The following analysis is conducted on data describing production of steel sheets and the presence of any errors – classes 4, 14, and 15 – on their surfaces. Data used to inform this anlysis was reduced from many larger original sets. Most dropped variables were removed based on the analysis in steps 1-5 of the data reduction and treatment stage, although some were also removed at the instruction of others and these are noted with the variable removals. This, and all earlier analysis, was done in an attempt to both reproduce those results found by Prof. Wilhelm et. al. and to improve on any previous analysis done in the original “ProtMod” file.
During these steps, almost all data files were successfully reproduced, excluding one file in step three, and the orginal identifying index matrix. With regard to the index matrix - all results which could be reproduced were done so accurately, but there exist three files in the final workspace data set which were not shown to be created in the orginal code file. Without requisite information to recreate these files, it would be possible to incorrectly identify observations in future analysis, and as such the orginal identifying matrix produced by Prof. Wilhelm was used throughout my anaylsis. Conversely, the data created in step 3 which did not match to previously produced data was included in my analysis instead of the orginal data file produced by Professor Wilhelm. These files differed by approx. 3,700 observations of a total approx. 33,800.
rm(list=ls())
library(readxl)
library(ggplot2)
library(plyr)
library(data.table)
library(tidyverse)
library(randomForest) #for random forests
library(caret) # for CV folds and data splitting
library(GGally)
library(MASS)
library(car)
library(party)
library(partykit)
library(xtable)
library(knitr)
library(kableExtra)
library(summarytools)
library(gridExtra)
load("../anna_data/anna_merged_data_1.Rdata")
#Lists of Var type by name - Index variables, Numeric variables, and Categorical variables
var.index <- c("MAT_IDENT", "lTileID", "CoilID")
var.num <- colnames(df[,sapply(df,is.numeric)])
var.factor <- c("VORG_HAUPTAGGREGAT.x", "BPW_ERZEUGUNG", "FLAEMMGRAD_IST", "TAUCHAUSGUSS")
#Reordering by index variables
df <- df %>%
dplyr::select(var.index, everything())
#Variable Reduction
#Dropping sf, RIEGELLAENGE.max.slab, & ltile_length
df <- df %>%
dplyr::select(-c(sf, RIEGELLAENGE.max.slab, lTile_length))
#Send to long format; review missing patterns
df_long <- df %>%
tidyr::gather(key=length_attr, value=measurement, -c(MAT_IDENT, lTileID, CoilID))
Computed below, by variable, is the number of observations (“count”), the number of unique observations (“unique”), the number of missing values (“na”), and the number of non-missing entries (“N”).
| length_attr | count | unique | na | N |
|---|---|---|---|---|
| ANST__VS_HG_3__IR__S | 26882 | 621 | 10513 | 16369 |
| ANST__VS_SP_3__IR__S | 26882 | 543 | 10513 | 16369 |
| ARGON_DRUCK_ST | 26882 | 426 | 1718 | 25164 |
| ARGON_DURCHFL_DUSCH | 26882 | 3146 | 1718 | 25164 |
| ARGON_DURCHFL_ST | 26882 | 897 | 1718 | 25164 |
| CHARGEN_NR | 26882 | 249 | 0 | 26882 |
| Class.14 | 26882 | 17 | 0 | 26882 |
| Class.15 | 26882 | 20 | 0 | 26882 |
| Class.4 | 26882 | 27 | 0 | 26882 |
| DICKE__AL__IR__S | 26882 | 16028 | 10513 | 16369 |
| DICKE__FB__IR__S | 26882 | 16028 | 10513 | 16369 |
| DICKE__HA_1__IR__S | 26882 | 16086 | 10513 | 16369 |
| DICKE__HA_2__IR__S | 26882 | 15569 | 10513 | 16369 |
| DICKE__VB__IR__S | 26882 | 596 | 10513 | 16369 |
| DT_FS | 26882 | 1860 | 1718 | 25164 |
| DT_LS | 26882 | 1952 | 1718 | 25164 |
| DT_SSL | 26882 | 1753 | 1718 | 25164 |
| DT_SSR | 26882 | 1613 | 1718 | 25164 |
| ENTZ__FS_ZW_F1__IR__S | 26882 | 89 | 10513 | 16369 |
| ENTZ__FS_ZW_F2__IR__S | 26882 | 96 | 10513 | 16369 |
| ENTZ__ZW_OF_AL__IR__S | 26882 | 4 | 10513 | 16369 |
| ENTZ__ZW2_AL__IR__S | 26882 | 61 | 10513 | 16369 |
| ENTZ__ZWR1_AL_SN2__IR__S | 26882 | 23 | 10513 | 16369 |
| ENTZ__ZWR1_EL_SN1__IR__S | 26882 | 25 | 10513 | 16369 |
| ENTZ__ZWR1_EL_SN3__IR__S | 26882 | 2 | 10513 | 16369 |
| FUELLSTAND | 26882 | 97 | 1718 | 25164 |
| FUELLSTAND_VHUZ | 26882 | 97 | 1718 | 25164 |
| KEIL25__FB__IR__S | 26882 | 15190 | 10513 | 16369 |
| KEIL40__FB__IR__S | 26882 | 15755 | 10513 | 16369 |
| KEIL50__FB__IR__S | 26882 | 15190 | 10513 | 16369 |
| KONI_LINKS | 26882 | 174 | 1718 | 25164 |
| KONI_RECHTS | 26882 | 181 | 1718 | 25164 |
| Length.max.slab | 26882 | 28 | 4034 | 22848 |
| NETTO_PFANNENINHALT | 26882 | 11689 | 1718 | 25164 |
| PLATTENDICKE_SSL | 26882 | 19 | 1718 | 25164 |
| PLATTENDICKE_SSR | 26882 | 19 | 1718 | 25164 |
| POSITION_X.x | 26882 | 25118 | 1718 | 25164 |
| POSITION_X.y | 26882 | 17482 | 4034 | 22848 |
| PR_40__FB__IR__S | 26882 | 15842 | 10513 | 16369 |
| RIEGELLAENGE | 26882 | 14346 | 1718 | 25164 |
| RISS__HA_AS__IR__S | 26882 | 4 | 10513 | 16369 |
| RISS__HA_BS__IR__S | 26882 | 9 | 10513 | 16369 |
| STOPFENSTELLUNG | 26882 | 262 | 1718 | 25164 |
| STRANGBREITE | 26882 | 118 | 1718 | 25164 |
| STRANGNUMMER | 26882 | 3 | 1718 | 25164 |
| TEMP__FB__IR__S | 26882 | 16109 | 10513 | 16369 |
| TEMP__FB_1__IR__S | 26882 | 16109 | 10513 | 16369 |
| TEMP__FB_2__IR__S | 26882 | 16109 | 10513 | 16369 |
| TEMP__FB_3__IR__S | 26882 | 16111 | 10513 | 16369 |
| TEMP__HA__IR__S | 26882 | 16109 | 10513 | 16369 |
| TEMP__HA__OS__IR__S | 26882 | 16112 | 10513 | 16369 |
| TEMP__HA__SR__MAX | 26882 | 127 | 10513 | 16369 |
| TEMP__HA__SR__MIN | 26882 | 128 | 10513 | 16369 |
| TEMP__HA__SR__S | 26882 | 127 | 10513 | 16369 |
| TEMP__HA_1__IR__S | 26882 | 16112 | 10513 | 16369 |
| TEMP__HA_2__IR__S | 26882 | 16112 | 10513 | 16369 |
| TEMP__HA_4__IR__S | 26882 | 16118 | 10513 | 16369 |
| TEMP__HA_5__IR__S | 26882 | 16099 | 10513 | 16369 |
| TEMP__VB__IR__S | 26882 | 632 | 10513 | 16369 |
| TEMP__VB_1__IR__S | 26882 | 632 | 10513 | 16369 |
| TEMP__VB_2__IR__S | 26882 | 632 | 10513 | 16369 |
| TEMP__VB_3__IR__S | 26882 | 632 | 10513 | 16369 |
| TEMP__VB_4__IR__S | 26882 | 632 | 10513 | 16369 |
| TEMP__VB_5__IR__S | 26882 | 651 | 10513 | 16369 |
| TM_FS_M | 26882 | 3979 | 1718 | 25164 |
| TM_FS_SSL | 26882 | 3763 | 1718 | 25164 |
| TM_FS_SSR | 26882 | 3874 | 1718 | 25164 |
| TM_LS_M | 26882 | 3614 | 1718 | 25164 |
| TM_LS_SSL | 26882 | 3636 | 1718 | 25164 |
| TM_LS_SSR | 26882 | 3879 | 1718 | 25164 |
| TM_SSL_FS | 26882 | 4216 | 1718 | 25164 |
| TM_SSL_LS | 26882 | 3632 | 1718 | 25164 |
| TM_SSR_FS | 26882 | 3894 | 1718 | 25164 |
| TM_SSR_LS | 26882 | 3471 | 1718 | 25164 |
| TO_FS_M | 26882 | 4911 | 1718 | 25164 |
| TO_FS_SSL | 26882 | 5124 | 2620 | 24262 |
| TO_FS_SSR | 26882 | 4557 | 1718 | 25164 |
| TO_LS_M | 26882 | 4319 | 1718 | 25164 |
| TO_LS_SSL | 26882 | 4276 | 1718 | 25164 |
| TO_LS_SSR | 26882 | 4372 | 1718 | 25164 |
| TO_SSL_FS | 26882 | 4991 | 1718 | 25164 |
| TO_SSL_LS | 26882 | 4879 | 1718 | 25164 |
| TO_SSR_FS | 26882 | 5008 | 1718 | 25164 |
| TO_SSR_LS | 26882 | 4851 | 1718 | 25164 |
| TU_FS_M | 26882 | 2938 | 2298 | 24584 |
| TU_FS_SSL | 26882 | 3048 | 1718 | 25164 |
| TU_FS_SSR | 26882 | 2948 | 1718 | 25164 |
| TU_LS_M | 26882 | 2815 | 1959 | 24923 |
| TU_LS_SSL | 26882 | 2945 | 2736 | 24146 |
| TU_LS_SSR | 26882 | 2976 | 1959 | 24923 |
| TU_SSL_FS | 26882 | 3365 | 1718 | 25164 |
| TU_SSL_LS | 26882 | 3206 | 1718 | 25164 |
| TU_SSR_FS | 26882 | 3205 | 1718 | 25164 |
| TU_SSR_LS | 26882 | 3189 | 1718 | 25164 |
| TUNDISH_POSITION | 26882 | 32 | 1718 | 25164 |
| V__FS_G1__IR__S | 26882 | 1296 | 10513 | 16369 |
| V__FS_G2__IR__S | 26882 | 2356 | 10513 | 16369 |
| V__FS_G3__IR__S | 26882 | 4205 | 10513 | 16369 |
| V__FS_G4__IR__S | 26882 | 6762 | 10513 | 16369 |
| V__FS_G5__IR__S | 26882 | 10559 | 10513 | 16369 |
| V__FS_G6__IR__S | 26882 | 13621 | 10513 | 16369 |
| V__FS_G7__IR__S | 26882 | 16100 | 10513 | 16369 |
| VERTEILERFUELLSTAND | 26882 | 677 | 1718 | 25164 |
| VG | 26882 | 571 | 1718 | 25164 |
| VORBRAMME | 26882 | 22 | 0 | 26882 |
| VORG_HAUPTAGGREGAT | 26882 | 3 | 1718 | 25164 |
| WASSER_FS | 26882 | 1091 | 1718 | 25164 |
| WASSER_LS | 26882 | 859 | 1718 | 25164 |
| WASSER_SSL | 26882 | 302 | 1718 | 25164 |
| WASSER_SSR | 26882 | 454 | 1718 | 25164 |
| WK__FS_G1__IR__S | 26882 | 1296 | 10513 | 16369 |
| WK__FS_G2__IR__S | 26882 | 2356 | 10513 | 16369 |
| WK__FS_G3__IR__S | 26882 | 4205 | 10513 | 16369 |
| WK__FS_G4__IR__S | 26882 | 6762 | 10513 | 16369 |
| WK__FS_G5__IR__S | 26882 | 10559 | 10513 | 16369 |
| WK__FS_G6__IR__S | 26882 | 13621 | 10513 | 16369 |
| WK__FS_G7__IR__S | 26882 | 16169 | 10513 | 16369 |
| WK__VS_HG_3__IR__S | 26882 | 621 | 10513 | 16369 |
| WK__VS_SP_3__IR__S | 26882 | 396 | 10513 | 16369 |
| WSPALT__FS_G1__IR__S | 26882 | 1296 | 10513 | 16369 |
| WSPALT__FS_G2__IR__S | 26882 | 2356 | 10513 | 16369 |
| WSPALT__FS_G3__IR__S | 26882 | 4211 | 10513 | 16369 |
| WSPALT__FS_G4__IR__S | 26882 | 6780 | 10513 | 16369 |
| WSPALT__FS_G5__IR__S | 26882 | 10604 | 10513 | 16369 |
| WSPALT__FS_G6__IR__S | 26882 | 13687 | 10513 | 16369 |
| WSPALT__FS_G7__IR__S | 26882 | 16169 | 10513 | 16369 |
With the above table we can review the number of missing values in each variable, if any patterns arise from these missings, and if there are any constant variables — variables without variation. We drop all constant variables, as by their own lack of variation they are not meaningful in explaining the variation in other variables.
#Drop constant vars
var.all.const <- df_desc%>%
dplyr::filter(unique == 1) %>%
dplyr::select(length_attr) %>%
unlist() %>%
as.character()
length(var.all.const)
## [1] 0
When reviewing var.all.const it becomes clear that there are no constant variables, so there is nothing here to drop.
With regard to the missing values, we can see from the table that there appears to be a rather obvious pattern; variables in our data almost all either have 10513, 1718, or no missing values, although certain varibles, which will be listed below, do not follow this pattern. As our data is identified across multiple levels of specificity (ie slab, tile, etc) this implies that certain groups of variables are collected only within certain scopes. As it may be the case that data for certain variables may be collected concurently, variables should be reviewed for correlation.
Next, we seperate the variables by data type, such as integer (int), categorical (factor), and numeric. The data frame is than split on those types.
#var lists by type
var.list <- colnames(df) #names of all variables
var.int <- colnames(df[,sapply(df,is.integer)]) #integer variables
var.factor <- colnames(df[,sapply(df,is.factor)]) #categorical vars
var.num <- colnames(df[,sapply(df,is.numeric)]) #numeric vars
#splitting DF by type
df_factor <- df %>%
dplyr::select(var.index,var.factor)
df_num <- df %>%
dplyr::select(var.num, -var.int)
df_num_long <- df_num %>%
gather(key=slab_attr, value=measurement)
#descriptives here are group over the entire df
| slab_attr | count | mean | sd | min | max | unique | na | N | max_freq | min_freq | freq_ratio |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ANST__VS_HG_3__IR__S | 26882 | 4.564310e-02 | 2.360949e-01 | 0.000000e+00 | 2.048611e+00 | 621 | 10513 | 16369 | 15750 | 1 | 25.3607085 |
| ANST__VS_SP_3__IR__S | 26882 | 1.861879e+00 | 9.709546e+00 | 0.000000e+00 | 8.524173e+01 | 543 | 10513 | 16369 | 15760 | 1 | 29.0220994 |
| ARGON_DRUCK_ST | 26882 | 5.986625e+01 | 1.870098e+01 | 1.800000e+01 | 1.000000e+02 | 426 | 1718 | 25164 | 1076 | 1 | 2.5234742 |
| ARGON_DURCHFL_DUSCH | 26882 | 1.152141e+02 | 3.499762e+01 | 5.200000e-01 | 1.900000e+02 | 4033 | 1718 | 25164 | 2822 | 1 | 0.6994793 |
| ARGON_DURCHFL_ST | 26882 | 8.326040e+00 | 5.982072e-01 | 3.410000e+00 | 1.108000e+01 | 1050 | 1718 | 25164 | 1952 | 1 | 1.8580952 |
| CHARGEN_NR | 26882 | 4.490450e+05 | 2.574589e+05 | 1.631710e+05 | 7.226910e+05 | 249 | 0 | 26882 | 1134 | 1 | 4.5502008 |
| Class.14 | 26882 | 2.823450e-02 | 4.637674e-01 | 0.000000e+00 | 3.700000e+01 | 17 | 0 | 26882 | 26664 | 1 | 1568.4117647 |
| Class.15 | 26882 | 2.652330e-02 | 4.595173e-01 | 0.000000e+00 | 2.800000e+01 | 20 | 0 | 26882 | 26684 | 1 | 1334.1500000 |
| Class.4 | 26882 | 4.746671e-01 | 1.265641e+00 | 0.000000e+00 | 3.500000e+01 | 27 | 0 | 26882 | 20569 | 1 | 761.7777778 |
| CoilID | 26882 | 1.914100e+07 | 3.678003e+05 | 1.865870e+07 | 2.002710e+07 | 657 | 0 | 26882 | 321 | 1 | 0.4870624 |
| DICKE__AL__IR__S | 26882 | 1.511021e-01 | 4.236360e-02 | 0.000000e+00 | 6.130659e-01 | 16028 | 10513 | 16369 | 343 | 1 | 0.0213377 |
| DICKE__FB__IR__S | 26882 | 1.511021e-01 | 4.236360e-02 | 0.000000e+00 | 6.130659e-01 | 16028 | 10513 | 16369 | 343 | 1 | 0.0213377 |
| DICKE__HA_1__IR__S | 26882 | 1.516607e-01 | 4.182080e-02 | 0.000000e+00 | 6.130659e-01 | 16086 | 10513 | 16369 | 285 | 1 | 0.0176551 |
| DICKE__HA_2__IR__S | 26882 | 1.459174e-01 | 4.722860e-02 | 0.000000e+00 | 4.192426e-01 | 15569 | 10513 | 16369 | 802 | 1 | 0.0514484 |
| DICKE__VB__IR__S | 26882 | 4.809820e-02 | 2.543457e-01 | 0.000000e+00 | 2.269721e+00 | 596 | 10513 | 16369 | 15775 | 1 | 26.4664430 |
| DT_FS | 26882 | 6.856955e+01 | 4.827426e+00 | 5.070000e+01 | 8.510000e+01 | 2260 | 1718 | 25164 | 533 | 1 | 0.2353982 |
| DT_LS | 26882 | 7.040780e+01 | 5.088746e+00 | 5.050000e+01 | 8.580000e+01 | 2415 | 1718 | 25164 | 459 | 1 | 0.1896480 |
| DT_SSL | 26882 | 5.877077e+01 | 4.419236e+00 | 4.358000e+01 | 6.970000e+01 | 2175 | 1718 | 25164 | 421 | 1 | 0.1931034 |
| DT_SSR | 26882 | 5.354157e+01 | 3.564115e+00 | 4.070000e+01 | 6.600000e+01 | 2055 | 1718 | 25164 | 486 | 1 | 0.2360097 |
| ENTZ__FS_ZW_F1__IR__S | 26882 | 3.392100e-03 | 1.191530e-02 | 0.000000e+00 | 8.333330e-02 | 89 | 10513 | 16369 | 15071 | 1 | 169.3258427 |
| ENTZ__FS_ZW_F2__IR__S | 26882 | 5.956500e-03 | 1.562020e-02 | 0.000000e+00 | 8.333330e-02 | 96 | 10513 | 16369 | 14176 | 1 | 147.6562500 |
| ENTZ__ZW_OF_AL__IR__S | 26882 | 3.000000e-06 | 2.196000e-04 | 0.000000e+00 | 1.639340e-02 | 4 | 10513 | 16369 | 16366 | 1 | 4091.2500000 |
| ENTZ__ZW2_AL__IR__S | 26882 | 1.527500e-03 | 7.791700e-03 | 0.000000e+00 | 6.666670e-02 | 61 | 10513 | 16369 | 15736 | 1 | 257.9508197 |
| ENTZ__ZWR1_AL_SN2__IR__S | 26882 | 1.111000e-04 | 2.261600e-03 | 0.000000e+00 | 6.666670e-02 | 23 | 10513 | 16369 | 16328 | 1 | 709.8695652 |
| ENTZ__ZWR1_EL_SN1__IR__S | 26882 | 1.006000e-04 | 2.112400e-03 | 0.000000e+00 | 6.666670e-02 | 25 | 10513 | 16369 | 16330 | 1 | 653.1600000 |
| ENTZ__ZWR1_EL_SN3__IR__S | 26882 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2 | 10513 | 16369 | 16369 | 16369 | 0.0000000 |
| FUELLSTAND | 26882 | 7.497974e+01 | 6.737898e-01 | 7.200000e+01 | 7.850000e+01 | 97 | 1718 | 25164 | 12711 | 1 | 131.0309278 |
| FUELLSTAND_VHUZ | 26882 | 7.497974e+01 | 6.737898e-01 | 7.200000e+01 | 7.850000e+01 | 97 | 1718 | 25164 | 12711 | 1 | 131.0309278 |
| KEIL25__FB__IR__S | 26882 | -2.246960e-02 | 8.339448e-01 | -4.823868e+00 | 3.480015e+00 | 15190 | 10513 | 16369 | 1181 | 1 | 0.0776827 |
| KEIL40__FB__IR__S | 26882 | -1.268257e-01 | 7.662683e-01 | -4.087126e+00 | 3.306016e+00 | 15755 | 10513 | 16369 | 616 | 1 | 0.0390352 |
| KEIL50__FB__IR__S | 26882 | -4.375930e-02 | 8.002416e-01 | -4.844529e+00 | 3.428814e+00 | 15190 | 10513 | 16369 | 1181 | 1 | 0.0776827 |
| KONI_LINKS | 26882 | 1.119920e+01 | 6.437812e-01 | 3.700000e+00 | 1.867500e+01 | 204 | 1718 | 25164 | 10203 | 1 | 50.0098039 |
| KONI_RECHTS | 26882 | 1.120665e+01 | 6.602210e-01 | 3.700000e+00 | 2.070000e+01 | 221 | 1718 | 25164 | 8391 | 1 | 37.9638009 |
| Length.max.slab | 26882 | 1.265407e+03 | 5.041442e+01 | 5.280000e+02 | 1.312000e+03 | 30 | 4034 | 22848 | 9384 | 3 | 312.7000000 |
| lTileID | 26882 | 2.433464e+02 | 1.369660e+02 | 1.000000e+00 | 5.100000e+02 | 511 | 7 | 26875 | 113 | 1 | 0.2191781 |
| MAT_IDENT | 26882 | 2.985188e+07 | 3.937609e+05 | 2.917029e+07 | 3.077100e+07 | 657 | 0 | 26882 | 321 | 1 | 0.4870624 |
| NETTO_PFANNENINHALT | 26882 | 1.648394e+02 | 7.692643e+01 | 0.000000e+00 | 4.040000e+02 | 12751 | 1718 | 25164 | 14 | 1 | 0.0010195 |
| PLATTENDICKE_SSL | 26882 | 4.829999e+01 | 2.019073e+00 | 4.320000e+01 | 5.000000e+01 | 40 | 1718 | 25164 | 8017 | 15 | 200.0500000 |
| PLATTENDICKE_SSR | 26882 | 4.745850e+01 | 2.089287e+00 | 4.329000e+01 | 5.000000e+01 | 35 | 1718 | 25164 | 5988 | 3 | 171.0000000 |
| POSITION_X.x | 26882 | 5.979768e+02 | 3.447785e+02 | 5.175509e+00 | 1.841143e+03 | 25133 | 1718 | 25164 | 2 | 1 | 0.0000398 |
| POSITION_X.y | 26882 | 5.420162e+02 | 3.192708e+02 | 3.703906e+00 | 1.289000e+03 | 17499 | 4034 | 22848 | 72 | 1 | 0.0040574 |
| PR_40__FB__IR__S | 26882 | 1.800993e+00 | 7.283150e-01 | 0.000000e+00 | 8.111780e+00 | 15842 | 10513 | 16369 | 529 | 1 | 0.0333291 |
| RIEGELLAENGE | 26882 | 5.140679e+00 | 2.881748e+00 | 5.400000e-02 | 1.158800e+01 | 17836 | 1718 | 25164 | 12 | 1 | 0.0006167 |
| RISS__HA_AS__IR__S | 26882 | 3.270000e-05 | 3.355500e-03 | 0.000000e+00 | 4.107143e-01 | 4 | 10513 | 16369 | 16367 | 1 | 4091.5000000 |
| RISS__HA_BS__IR__S | 26882 | 1.509000e-04 | 8.996600e-03 | 0.000000e+00 | 7.083333e-01 | 9 | 10513 | 16369 | 16362 | 1 | 1817.8888889 |
| STOPFENSTELLUNG | 26882 | 5.489166e+01 | 5.534024e+00 | 4.300000e+01 | 7.000000e+01 | 262 | 1718 | 25164 | 2270 | 1 | 8.6603053 |
| STRANGBREITE | 26882 | 2.490284e+03 | 1.190497e+02 | 2.151000e+03 | 2.577000e+03 | 118 | 1718 | 25164 | 9198 | 1 | 77.9406780 |
| STRANGNUMMER | 26882 | 1.460777e+00 | 4.984691e-01 | 1.000000e+00 | 2.000000e+00 | 3 | 1718 | 25164 | 13569 | 11595 | 658.0000000 |
| TEMP__FB__IR__S | 26882 | 4.497819e+01 | 1.231432e+01 | 0.000000e+00 | 1.956925e+02 | 16109 | 10513 | 16369 | 262 | 1 | 0.0162021 |
| TEMP__FB_1__IR__S | 26882 | 4.488583e+01 | 1.228837e+01 | 0.000000e+00 | 1.951373e+02 | 16109 | 10513 | 16369 | 262 | 1 | 0.0162021 |
| TEMP__FB_2__IR__S | 26882 | 4.501117e+01 | 1.232453e+01 | 0.000000e+00 | 1.956925e+02 | 16109 | 10513 | 16369 | 262 | 1 | 0.0162021 |
| TEMP__FB_3__IR__S | 26882 | 4.485969e+01 | 1.367100e+01 | 0.000000e+00 | 4.520970e+02 | 16111 | 10513 | 16369 | 260 | 1 | 0.0160760 |
| TEMP__HA__IR__S | 26882 | 3.136023e+01 | 8.657163e+00 | 0.000000e+00 | 1.549556e+02 | 16109 | 10513 | 16369 | 262 | 1 | 0.0162021 |
| TEMP__HA__OS__IR__S | 26882 | 3.127105e+01 | 8.688989e+00 | 0.000000e+00 | 1.550227e+02 | 16112 | 10513 | 16369 | 259 | 1 | 0.0160129 |
| TEMP__HA__SR__MAX | 26882 | 3.561846e+01 | 4.902349e+01 | 0.000000e+00 | 6.400000e+02 | 127 | 10513 | 16369 | 831 | 1 | 6.5354331 |
| TEMP__HA__SR__MIN | 26882 | 3.339235e+01 | 4.595951e+01 | 0.000000e+00 | 6.000000e+02 | 128 | 10513 | 16369 | 831 | 1 | 6.4843750 |
| TEMP__HA__SR__S | 26882 | 3.450541e+01 | 4.749150e+01 | 0.000000e+00 | 6.200000e+02 | 127 | 10513 | 16369 | 831 | 1 | 6.5354331 |
| TEMP__HA_1__IR__S | 26882 | 3.137055e+01 | 8.701447e+00 | 0.000000e+00 | 1.549556e+02 | 16112 | 10513 | 16369 | 259 | 1 | 0.0160129 |
| TEMP__HA_2__IR__S | 26882 | 3.127105e+01 | 8.688989e+00 | 0.000000e+00 | 1.550227e+02 | 16112 | 10513 | 16369 | 259 | 1 | 0.0160129 |
| TEMP__HA_4__IR__S | 26882 | 2.891099e+01 | 7.921209e+00 | 0.000000e+00 | 1.079578e+02 | 16118 | 10513 | 16369 | 253 | 1 | 0.0156347 |
| TEMP__HA_5__IR__S | 26882 | 3.086202e+01 | 8.576830e+00 | 0.000000e+00 | 1.451340e+02 | 16099 | 10513 | 16369 | 272 | 1 | 0.0168333 |
| TEMP__VB__IR__S | 26882 | 1.603233e+00 | 8.225923e+00 | 0.000000e+00 | 7.206003e+01 | 632 | 10513 | 16369 | 15739 | 1 | 24.9018987 |
| TEMP__VB_1__IR__S | 26882 | 1.602526e+00 | 8.220138e+00 | 0.000000e+00 | 7.206003e+01 | 632 | 10513 | 16369 | 15739 | 1 | 24.9018987 |
| TEMP__VB_2__IR__S | 26882 | 1.601913e+00 | 8.216825e+00 | 0.000000e+00 | 7.172220e+01 | 632 | 10513 | 16369 | 15739 | 1 | 24.9018987 |
| TEMP__VB_3__IR__S | 26882 | 1.610312e+00 | 8.259578e+00 | 0.000000e+00 | 7.185537e+01 | 632 | 10513 | 16369 | 15739 | 1 | 24.9018987 |
| TEMP__VB_4__IR__S | 26882 | 1.611833e+00 | 8.267230e+00 | 0.000000e+00 | 7.199982e+01 | 632 | 10513 | 16369 | 15739 | 1 | 24.9018987 |
| TEMP__VB_5__IR__S | 26882 | 1.699921e+00 | 8.578367e+00 | 0.000000e+00 | 7.287423e+01 | 651 | 10513 | 16369 | 15720 | 1 | 24.1459293 |
| TM_FS_M | 26882 | 1.288398e+02 | 1.178362e+01 | 9.320000e+01 | 1.566000e+02 | 5450 | 1718 | 25164 | 62 | 1 | 0.0111927 |
| TM_FS_SSL | 26882 | 1.277399e+02 | 1.056593e+01 | 9.333333e+01 | 1.597000e+02 | 5226 | 1718 | 25164 | 56 | 1 | 0.0105243 |
| TM_FS_SSR | 26882 | 1.285015e+02 | 1.120327e+01 | 9.915000e+01 | 1.614500e+02 | 5206 | 1718 | 25164 | 71 | 1 | 0.0134460 |
| TM_LS_M | 26882 | 1.265216e+02 | 1.003511e+01 | 8.960000e+01 | 1.510000e+02 | 4937 | 1718 | 25164 | 73 | 1 | 0.0145838 |
| TM_LS_SSL | 26882 | 1.318702e+02 | 1.032792e+01 | 1.014000e+02 | 1.628667e+02 | 4988 | 1718 | 25164 | 75 | 1 | 0.0148356 |
| TM_LS_SSR | 26882 | 1.321628e+02 | 1.085231e+01 | 9.990000e+01 | 1.691000e+02 | 5321 | 1718 | 25164 | 73 | 1 | 0.0135313 |
| TM_SSL_FS | 26882 | 1.453342e+02 | 1.446186e+01 | 1.041750e+02 | 1.791000e+02 | 5156 | 1718 | 25164 | 77 | 1 | 0.0147401 |
| TM_SSL_LS | 26882 | 1.329567e+02 | 1.021378e+01 | 1.073667e+02 | 1.645000e+02 | 4517 | 1718 | 25164 | 109 | 1 | 0.0239097 |
| TM_SSR_FS | 26882 | 1.376102e+02 | 1.083885e+01 | 1.031500e+02 | 1.767000e+02 | 4857 | 1718 | 25164 | 79 | 1 | 0.0160593 |
| TM_SSR_LS | 26882 | 1.301028e+02 | 8.983733e+00 | 1.016667e+02 | 1.598000e+02 | 4312 | 1718 | 25164 | 94 | 1 | 0.0215677 |
| TO_FS_M | 26882 | 1.846690e+02 | 1.613626e+01 | 1.339333e+02 | 2.244500e+02 | 6683 | 1718 | 25164 | 47 | 1 | 0.0068831 |
| TO_FS_SSL | 26882 | 1.885192e+02 | 2.793312e+01 | 8.000000e-01 | 7.977000e+02 | 6835 | 2620 | 24262 | 51 | 1 | 0.0073153 |
| TO_FS_SSR | 26882 | 1.913317e+02 | 1.379441e+01 | 1.441750e+02 | 2.262500e+02 | 6299 | 1718 | 25164 | 58 | 1 | 0.0090491 |
| TO_LS_M | 26882 | 1.863830e+02 | 1.294765e+01 | 1.388000e+02 | 2.216500e+02 | 5884 | 1718 | 25164 | 69 | 1 | 0.0115568 |
| TO_LS_SSL | 26882 | 1.961198e+02 | 1.250285e+01 | 1.511500e+02 | 2.289000e+02 | 5864 | 1718 | 25164 | 63 | 1 | 0.0105730 |
| TO_LS_SSR | 26882 | 1.949970e+02 | 1.264377e+01 | 1.455000e+02 | 2.248000e+02 | 5991 | 1718 | 25164 | 65 | 1 | 0.0106827 |
| TO_SSL_FS | 26882 | 2.052935e+02 | 1.576007e+01 | 1.476000e+02 | 2.410000e+02 | 6657 | 1718 | 25164 | 53 | 1 | 0.0078113 |
| TO_SSL_LS | 26882 | 2.015116e+02 | 1.541129e+01 | 1.503667e+02 | 2.377000e+02 | 6521 | 1718 | 25164 | 50 | 1 | 0.0075142 |
| TO_SSR_FS | 26882 | 2.030546e+02 | 1.587309e+01 | 1.448667e+02 | 2.412000e+02 | 6750 | 1718 | 25164 | 57 | 1 | 0.0082963 |
| TO_SSR_LS | 26882 | 1.983856e+02 | 1.474387e+01 | 1.408250e+02 | 2.367500e+02 | 6523 | 1718 | 25164 | 56 | 1 | 0.0084317 |
| TU_FS_M | 26882 | 1.116696e+02 | 7.413646e+00 | 8.015000e+01 | 1.304500e+02 | 4243 | 2298 | 24584 | 92 | 1 | 0.0214471 |
| TU_FS_SSL | 26882 | 1.087951e+02 | 7.716555e+00 | 8.368000e+01 | 1.342000e+02 | 4368 | 1718 | 25164 | 91 | 1 | 0.0206044 |
| TU_FS_SSR | 26882 | 1.089810e+02 | 7.231841e+00 | 8.084000e+01 | 1.348500e+02 | 4249 | 1718 | 25164 | 126 | 1 | 0.0294187 |
| TU_LS_M | 26882 | 1.086232e+02 | 6.924098e+00 | 6.540000e+01 | 1.294500e+02 | 4090 | 1959 | 24923 | 100 | 1 | 0.0242054 |
| TU_LS_SSL | 26882 | 1.152493e+02 | 7.128589e+00 | 8.065000e+01 | 1.382500e+02 | 4182 | 2736 | 24146 | 105 | 1 | 0.0248685 |
| TU_LS_SSR | 26882 | 1.150699e+02 | 7.318471e+00 | 8.833333e+01 | 1.394000e+02 | 4300 | 1959 | 24923 | 83 | 1 | 0.0190698 |
| TU_SSL_FS | 26882 | 1.257709e+02 | 9.163527e+00 | 8.836667e+01 | 1.524000e+02 | 4481 | 1718 | 25164 | 76 | 1 | 0.0167373 |
| TU_SSL_LS | 26882 | 1.153944e+02 | 7.976129e+00 | 9.183333e+01 | 1.436000e+02 | 4297 | 1718 | 25164 | 94 | 1 | 0.0216430 |
| TU_SSR_FS | 26882 | 1.221326e+02 | 8.008096e+00 | 8.710000e+01 | 1.467000e+02 | 4318 | 1718 | 25164 | 86 | 1 | 0.0196850 |
| TU_SSR_LS | 26882 | 1.141591e+02 | 7.792329e+00 | 8.290000e+01 | 1.485000e+02 | 4290 | 1718 | 25164 | 117 | 1 | 0.0270396 |
| TUNDISH_POSITION | 26882 | 1.271549e+01 | 1.156004e+01 | 0.000000e+00 | 4.200000e+01 | 32 | 1718 | 25164 | 4115 | 1 | 128.5625000 |
| V__FS_G1__IR__S | 26882 | 3.216581e-01 | 1.135128e+00 | 0.000000e+00 | 8.171358e+00 | 1296 | 10513 | 16369 | 15075 | 1 | 11.6311728 |
| V__FS_G2__IR__S | 26882 | 9.235807e-01 | 2.339188e+00 | 0.000000e+00 | 1.247841e+01 | 2356 | 10513 | 16369 | 14015 | 1 | 5.9482173 |
| V__FS_G3__IR__S | 26882 | 2.603245e+00 | 4.623162e+00 | 0.000000e+00 | 2.758184e+01 | 4205 | 10513 | 16369 | 12166 | 1 | 2.8929845 |
| V__FS_G4__IR__S | 26882 | 6.294015e+00 | 7.883289e+00 | 0.000000e+00 | 4.301559e+01 | 6762 | 10513 | 16369 | 9609 | 1 | 1.4208814 |
| V__FS_G5__IR__S | 26882 | 1.486168e+01 | 1.202346e+01 | 0.000000e+00 | 7.304973e+01 | 10559 | 10513 | 16369 | 5812 | 1 | 0.5503362 |
| V__FS_G6__IR__S | 26882 | 2.531957e+01 | 1.339378e+01 | 0.000000e+00 | 9.872687e+01 | 13621 | 10513 | 16369 | 2750 | 1 | 0.2018207 |
| V__FS_G7__IR__S | 26882 | 3.673649e+01 | 1.082059e+01 | 0.000000e+00 | 1.611927e+02 | 16100 | 10513 | 16369 | 271 | 1 | 0.0167702 |
| VERTEILERFUELLSTAND | 26882 | 7.882090e+01 | 1.623281e+00 | 6.132000e+01 | 8.246667e+01 | 844 | 1718 | 25164 | 1457 | 1 | 1.7251185 |
| VG | 26882 | 9.885491e-01 | 9.614080e-02 | 7.610000e-01 | 1.157000e+00 | 819 | 1718 | 25164 | 2199 | 1 | 2.6837607 |
| VORBRAMME | 26882 | 2.670925e+02 | 2.481209e+02 | 2.200000e+01 | 5.530000e+02 | 22 | 0 | 26882 | 3016 | 9 | 136.6818182 |
| WASSER_FS | 26882 | 2.661484e+00 | 1.759780e-02 | 2.563600e+00 | 2.720500e+00 | 1765 | 1718 | 25164 | 529 | 1 | 0.2991501 |
| WASSER_LS | 26882 | 2.669746e+00 | 1.256150e-02 | 2.595750e+00 | 2.719500e+00 | 1399 | 1718 | 25164 | 854 | 1 | 0.6097212 |
| WASSER_SSL | 26882 | 2.626474e-01 | 8.790700e-03 | 2.450000e-01 | 2.782000e-01 | 352 | 1718 | 25164 | 2249 | 1 | 6.3863636 |
| WASSER_SSR | 26882 | 2.572780e-01 | 1.135910e-02 | 2.370000e-01 | 2.810000e-01 | 535 | 1718 | 25164 | 1587 | 1 | 2.9644860 |
| WK__FS_G1__IR__S | 26882 | 4.442741e+01 | 1.564459e+02 | 0.000000e+00 | 1.077098e+03 | 1296 | 10513 | 16369 | 15075 | 1 | 11.6311728 |
| WK__FS_G2__IR__S | 26882 | 8.591459e+01 | 2.168795e+02 | 0.000000e+00 | 1.202033e+03 | 2356 | 10513 | 16369 | 14015 | 1 | 5.9482173 |
| WK__FS_G3__IR__S | 26882 | 1.582005e+02 | 2.797363e+02 | 0.000000e+00 | 1.679224e+03 | 4205 | 10513 | 16369 | 12166 | 1 | 2.8929845 |
| WK__FS_G4__IR__S | 26882 | 2.469891e+02 | 3.074766e+02 | 0.000000e+00 | 1.692017e+03 | 6762 | 10513 | 16369 | 9609 | 1 | 1.4208814 |
| WK__FS_G5__IR__S | 26882 | 3.512747e+02 | 2.814103e+02 | 0.000000e+00 | 1.725439e+03 | 10559 | 10513 | 16369 | 5812 | 1 | 0.5503362 |
| WK__FS_G6__IR__S | 26882 | 3.571932e+02 | 1.870432e+02 | 0.000000e+00 | 1.296698e+03 | 13621 | 10513 | 16369 | 2750 | 1 | 0.2018207 |
| WK__FS_G7__IR__S | 26882 | 3.018373e+02 | 8.826540e+01 | 0.000000e+00 | 1.437853e+03 | 16169 | 10513 | 16369 | 202 | 1 | 0.0124312 |
| WK__VS_HG_3__IR__S | 26882 | 2.081314e+01 | 1.078005e+02 | 0.000000e+00 | 9.421366e+02 | 621 | 10513 | 16369 | 15750 | 1 | 25.3607085 |
| WK__VS_SP_3__IR__S | 26882 | 6.795180e-02 | 5.596544e-01 | -3.138046e-01 | 1.225219e+01 | 396 | 10513 | 16369 | 15964 | 1 | 40.3106061 |
| WSPALT__FS_G1__IR__S | 26882 | 7.454170e-02 | 2.619365e-01 | 0.000000e+00 | 1.830767e+00 | 1296 | 10513 | 16369 | 15075 | 1 | 11.6311728 |
| WSPALT__FS_G2__IR__S | 26882 | 9.341970e-02 | 2.353754e-01 | 0.000000e+00 | 1.252877e+00 | 2356 | 10513 | 16369 | 14015 | 1 | 5.9482173 |
| WSPALT__FS_G3__IR__S | 26882 | 1.157892e-01 | 2.043523e-01 | 0.000000e+00 | 1.208468e+00 | 4211 | 10513 | 16369 | 12160 | 1 | 2.8874377 |
| WSPALT__FS_G4__IR__S | 26882 | 1.295598e-01 | 1.611171e-01 | 0.000000e+00 | 8.493480e-01 | 6780 | 10513 | 16369 | 9591 | 1 | 1.4144543 |
| WSPALT__FS_G5__IR__S | 26882 | 1.390093e-01 | 1.109823e-01 | 0.000000e+00 | 6.135501e-01 | 10604 | 10513 | 16369 | 5767 | 1 | 0.5437571 |
| WSPALT__FS_G6__IR__S | 26882 | 1.355501e-01 | 7.061980e-02 | 0.000000e+00 | 5.206122e-01 | 13687 | 10513 | 16369 | 2684 | 1 | 0.1960254 |
| WSPALT__FS_G7__IR__S | 26882 | 1.647887e-01 | 4.480130e-02 | 0.000000e+00 | 6.212929e-01 | 16169 | 10513 | 16369 | 202 | 1 | 0.0124312 |
Again, we are looking for variables without variation and any interesting patterns within the data. Although above, it was determined that a variable was constant by reviewing how many unique observations it had, here variables are checked for a standard deviation of 0.
#Dropping variables with no variation
var.sd0 <- df_num_desc %>%
dplyr::filter(sd==0) %>%
dplyr::select(slab_attr) %>%
unlist() %>%
as.character()
df_num_long <- df_num_long %>%
dplyr::filter(!slab_attr %in% var.sd0)
df_num <- df_num %>%
dplyr::select(-var.sd0)
From the table above we see that there is only one variable with a standard deviation of zero — "ENTZ__ZWR1_EL_SN3__IR__S". This is also the only variable selected and dropped in the above code chunk. We now calculate the pairwise correlation of all our numeric variables.
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
## Registered S3 method overwritten by 'seriation':
## method from
## reorder.hclust gclus
## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.
The plot above shows the pairwise correlation of all variables, with postive correlation being marked in green and negative correlation marked with red. Additionally, variables are arranged such that groups of correlated variables are listed together. For all pairs which are absolutely correlated — i.e. having correlation of 1 or -1 – only one member of the pair will be kept in the final data set. When variables are totally correlated, the informtion that they provide in describing the variability of other variables is redundant. The latter half of the pair is dropped in order to account for this redundancy.
#For all pairs which are fully, absolutely correlated, we keep only one
corONE <- function(x) {
if (is.matrix(x)) {
cor1.df <- data.frame(which(abs(x)==1, arr.in=TRUE))
setDT(cor1.df, keep.rownames = TRUE)[]
cor1.list <- cor1.df$rn[which(cor1.df$row > cor1.df$col, arr.in=TRUE)]
grx <- glob2rx("*.*")
duplicate.list <- grepl(grx,cor1.list, perl=TRUE)
cor1.list <- cor1.list[!duplicate.list]
} else {
print("no matrix!")
}
}
#list of all totally correlated variables
cor1.list <- corONE(cormat)
write.table(cor1.list, file="anna_Length_ListofVariableswithCor1.txt", sep="\t")
#Drop corresponding columns
df_num <- df_num %>%
dplyr::select(-cor1.list)
| slab_attr | count | mean | sd | min | max | unique | na | N |
|---|---|---|---|---|---|---|---|---|
| ANST__VS_HG_3__IR__S | 26882 | 4.564310e-02 | 2.360949e-01 | 0.000000e+00 | 2.048611e+00 | 621 | 10513 | 16369 |
| ANST__VS_SP_3__IR__S | 26882 | 1.861879e+00 | 9.709546e+00 | 0.000000e+00 | 8.524173e+01 | 543 | 10513 | 16369 |
| ARGON_DRUCK_ST | 26882 | 5.986625e+01 | 1.870098e+01 | 1.800000e+01 | 1.000000e+02 | 426 | 1718 | 25164 |
| ARGON_DURCHFL_DUSCH | 26882 | 1.152141e+02 | 3.499762e+01 | 5.200000e-01 | 1.900000e+02 | 4033 | 1718 | 25164 |
| ARGON_DURCHFL_ST | 26882 | 8.326040e+00 | 5.982072e-01 | 3.410000e+00 | 1.108000e+01 | 1050 | 1718 | 25164 |
| CHARGEN_NR | 26882 | 4.490450e+05 | 2.574589e+05 | 1.631710e+05 | 7.226910e+05 | 249 | 0 | 26882 |
| Class.14 | 26882 | 2.823450e-02 | 4.637674e-01 | 0.000000e+00 | 3.700000e+01 | 17 | 0 | 26882 |
| Class.15 | 26882 | 2.652330e-02 | 4.595173e-01 | 0.000000e+00 | 2.800000e+01 | 20 | 0 | 26882 |
| Class.4 | 26882 | 4.746671e-01 | 1.265641e+00 | 0.000000e+00 | 3.500000e+01 | 27 | 0 | 26882 |
| CoilID | 26882 | 1.914100e+07 | 3.678003e+05 | 1.865870e+07 | 2.002710e+07 | 657 | 0 | 26882 |
| DICKE__AL__IR__S | 26882 | 1.511021e-01 | 4.236360e-02 | 0.000000e+00 | 6.130659e-01 | 16028 | 10513 | 16369 |
| DICKE__HA_1__IR__S | 26882 | 1.516607e-01 | 4.182080e-02 | 0.000000e+00 | 6.130659e-01 | 16086 | 10513 | 16369 |
| DICKE__HA_2__IR__S | 26882 | 1.459174e-01 | 4.722860e-02 | 0.000000e+00 | 4.192426e-01 | 15569 | 10513 | 16369 |
| DICKE__VB__IR__S | 26882 | 4.809820e-02 | 2.543457e-01 | 0.000000e+00 | 2.269721e+00 | 596 | 10513 | 16369 |
| DT_FS | 26882 | 6.856955e+01 | 4.827426e+00 | 5.070000e+01 | 8.510000e+01 | 2260 | 1718 | 25164 |
| DT_LS | 26882 | 7.040780e+01 | 5.088746e+00 | 5.050000e+01 | 8.580000e+01 | 2415 | 1718 | 25164 |
| DT_SSL | 26882 | 5.877077e+01 | 4.419236e+00 | 4.358000e+01 | 6.970000e+01 | 2175 | 1718 | 25164 |
| DT_SSR | 26882 | 5.354157e+01 | 3.564115e+00 | 4.070000e+01 | 6.600000e+01 | 2055 | 1718 | 25164 |
| ENTZ__FS_ZW_F1__IR__S | 26882 | 3.392100e-03 | 1.191530e-02 | 0.000000e+00 | 8.333330e-02 | 89 | 10513 | 16369 |
| ENTZ__FS_ZW_F2__IR__S | 26882 | 5.956500e-03 | 1.562020e-02 | 0.000000e+00 | 8.333330e-02 | 96 | 10513 | 16369 |
| ENTZ__ZW_OF_AL__IR__S | 26882 | 3.000000e-06 | 2.196000e-04 | 0.000000e+00 | 1.639340e-02 | 4 | 10513 | 16369 |
| ENTZ__ZW2_AL__IR__S | 26882 | 1.527500e-03 | 7.791700e-03 | 0.000000e+00 | 6.666670e-02 | 61 | 10513 | 16369 |
| ENTZ__ZWR1_AL_SN2__IR__S | 26882 | 1.111000e-04 | 2.261600e-03 | 0.000000e+00 | 6.666670e-02 | 23 | 10513 | 16369 |
| ENTZ__ZWR1_EL_SN1__IR__S | 26882 | 1.006000e-04 | 2.112400e-03 | 0.000000e+00 | 6.666670e-02 | 25 | 10513 | 16369 |
| FUELLSTAND | 26882 | 7.497974e+01 | 6.737898e-01 | 7.200000e+01 | 7.850000e+01 | 97 | 1718 | 25164 |
| KEIL25__FB__IR__S | 26882 | -2.246960e-02 | 8.339448e-01 | -4.823868e+00 | 3.480015e+00 | 15190 | 10513 | 16369 |
| KEIL40__FB__IR__S | 26882 | -1.268257e-01 | 7.662683e-01 | -4.087126e+00 | 3.306016e+00 | 15755 | 10513 | 16369 |
| KEIL50__FB__IR__S | 26882 | -4.375930e-02 | 8.002416e-01 | -4.844529e+00 | 3.428814e+00 | 15190 | 10513 | 16369 |
| KONI_LINKS | 26882 | 1.119920e+01 | 6.437812e-01 | 3.700000e+00 | 1.867500e+01 | 204 | 1718 | 25164 |
| KONI_RECHTS | 26882 | 1.120665e+01 | 6.602210e-01 | 3.700000e+00 | 2.070000e+01 | 221 | 1718 | 25164 |
| Length.max.slab | 26882 | 1.265407e+03 | 5.041442e+01 | 5.280000e+02 | 1.312000e+03 | 30 | 4034 | 22848 |
| lTileID | 26882 | 2.433464e+02 | 1.369660e+02 | 1.000000e+00 | 5.100000e+02 | 511 | 7 | 26875 |
| MAT_IDENT | 26882 | 2.985188e+07 | 3.937609e+05 | 2.917029e+07 | 3.077100e+07 | 657 | 0 | 26882 |
| NETTO_PFANNENINHALT | 26882 | 1.648394e+02 | 7.692643e+01 | 0.000000e+00 | 4.040000e+02 | 12751 | 1718 | 25164 |
| PLATTENDICKE_SSL | 26882 | 4.829999e+01 | 2.019073e+00 | 4.320000e+01 | 5.000000e+01 | 40 | 1718 | 25164 |
| PLATTENDICKE_SSR | 26882 | 4.745850e+01 | 2.089287e+00 | 4.329000e+01 | 5.000000e+01 | 35 | 1718 | 25164 |
| POSITION_X.x | 26882 | 5.979768e+02 | 3.447785e+02 | 5.175509e+00 | 1.841143e+03 | 25133 | 1718 | 25164 |
| POSITION_X.y | 26882 | 5.420162e+02 | 3.192708e+02 | 3.703906e+00 | 1.289000e+03 | 17499 | 4034 | 22848 |
| PR_40__FB__IR__S | 26882 | 1.800993e+00 | 7.283150e-01 | 0.000000e+00 | 8.111780e+00 | 15842 | 10513 | 16369 |
| RIEGELLAENGE | 26882 | 5.140679e+00 | 2.881748e+00 | 5.400000e-02 | 1.158800e+01 | 17836 | 1718 | 25164 |
| RISS__HA_AS__IR__S | 26882 | 3.270000e-05 | 3.355500e-03 | 0.000000e+00 | 4.107143e-01 | 4 | 10513 | 16369 |
| RISS__HA_BS__IR__S | 26882 | 1.509000e-04 | 8.996600e-03 | 0.000000e+00 | 7.083333e-01 | 9 | 10513 | 16369 |
| STOPFENSTELLUNG | 26882 | 5.489166e+01 | 5.534024e+00 | 4.300000e+01 | 7.000000e+01 | 262 | 1718 | 25164 |
| STRANGBREITE | 26882 | 2.490284e+03 | 1.190497e+02 | 2.151000e+03 | 2.577000e+03 | 118 | 1718 | 25164 |
| STRANGNUMMER | 26882 | 1.460777e+00 | 4.984691e-01 | 1.000000e+00 | 2.000000e+00 | 3 | 1718 | 25164 |
| TEMP__FB__IR__S | 26882 | 4.497819e+01 | 1.231432e+01 | 0.000000e+00 | 1.956925e+02 | 16109 | 10513 | 16369 |
| TEMP__FB_1__IR__S | 26882 | 4.488583e+01 | 1.228837e+01 | 0.000000e+00 | 1.951373e+02 | 16109 | 10513 | 16369 |
| TEMP__FB_2__IR__S | 26882 | 4.501117e+01 | 1.232453e+01 | 0.000000e+00 | 1.956925e+02 | 16109 | 10513 | 16369 |
| TEMP__FB_3__IR__S | 26882 | 4.485969e+01 | 1.367100e+01 | 0.000000e+00 | 4.520970e+02 | 16111 | 10513 | 16369 |
| TEMP__HA__IR__S | 26882 | 3.136023e+01 | 8.657163e+00 | 0.000000e+00 | 1.549556e+02 | 16109 | 10513 | 16369 |
| TEMP__HA__SR__MAX | 26882 | 3.561846e+01 | 4.902349e+01 | 0.000000e+00 | 6.400000e+02 | 127 | 10513 | 16369 |
| TEMP__HA_1__IR__S | 26882 | 3.137055e+01 | 8.701447e+00 | 0.000000e+00 | 1.549556e+02 | 16112 | 10513 | 16369 |
| TEMP__HA_2__IR__S | 26882 | 3.127105e+01 | 8.688989e+00 | 0.000000e+00 | 1.550227e+02 | 16112 | 10513 | 16369 |
| TEMP__HA_4__IR__S | 26882 | 2.891099e+01 | 7.921209e+00 | 0.000000e+00 | 1.079578e+02 | 16118 | 10513 | 16369 |
| TEMP__HA_5__IR__S | 26882 | 3.086202e+01 | 8.576830e+00 | 0.000000e+00 | 1.451340e+02 | 16099 | 10513 | 16369 |
| TEMP__VB__IR__S | 26882 | 1.603233e+00 | 8.225923e+00 | 0.000000e+00 | 7.206003e+01 | 632 | 10513 | 16369 |
| TEMP__VB_1__IR__S | 26882 | 1.602526e+00 | 8.220138e+00 | 0.000000e+00 | 7.206003e+01 | 632 | 10513 | 16369 |
| TEMP__VB_5__IR__S | 26882 | 1.699921e+00 | 8.578367e+00 | 0.000000e+00 | 7.287423e+01 | 651 | 10513 | 16369 |
| TM_FS_M | 26882 | 1.288398e+02 | 1.178362e+01 | 9.320000e+01 | 1.566000e+02 | 5450 | 1718 | 25164 |
| TM_FS_SSL | 26882 | 1.277399e+02 | 1.056593e+01 | 9.333333e+01 | 1.597000e+02 | 5226 | 1718 | 25164 |
| TM_FS_SSR | 26882 | 1.285015e+02 | 1.120327e+01 | 9.915000e+01 | 1.614500e+02 | 5206 | 1718 | 25164 |
| TM_LS_M | 26882 | 1.265216e+02 | 1.003511e+01 | 8.960000e+01 | 1.510000e+02 | 4937 | 1718 | 25164 |
| TM_LS_SSL | 26882 | 1.318702e+02 | 1.032792e+01 | 1.014000e+02 | 1.628667e+02 | 4988 | 1718 | 25164 |
| TM_LS_SSR | 26882 | 1.321628e+02 | 1.085231e+01 | 9.990000e+01 | 1.691000e+02 | 5321 | 1718 | 25164 |
| TM_SSL_FS | 26882 | 1.453342e+02 | 1.446186e+01 | 1.041750e+02 | 1.791000e+02 | 5156 | 1718 | 25164 |
| TM_SSL_LS | 26882 | 1.329567e+02 | 1.021378e+01 | 1.073667e+02 | 1.645000e+02 | 4517 | 1718 | 25164 |
| TM_SSR_FS | 26882 | 1.376102e+02 | 1.083885e+01 | 1.031500e+02 | 1.767000e+02 | 4857 | 1718 | 25164 |
| TM_SSR_LS | 26882 | 1.301028e+02 | 8.983733e+00 | 1.016667e+02 | 1.598000e+02 | 4312 | 1718 | 25164 |
| TO_FS_M | 26882 | 1.846690e+02 | 1.613626e+01 | 1.339333e+02 | 2.244500e+02 | 6683 | 1718 | 25164 |
| TO_FS_SSL | 26882 | 1.885192e+02 | 2.793312e+01 | 8.000000e-01 | 7.977000e+02 | 6835 | 2620 | 24262 |
| TO_FS_SSR | 26882 | 1.913317e+02 | 1.379441e+01 | 1.441750e+02 | 2.262500e+02 | 6299 | 1718 | 25164 |
| TO_LS_M | 26882 | 1.863830e+02 | 1.294765e+01 | 1.388000e+02 | 2.216500e+02 | 5884 | 1718 | 25164 |
| TO_LS_SSL | 26882 | 1.961198e+02 | 1.250285e+01 | 1.511500e+02 | 2.289000e+02 | 5864 | 1718 | 25164 |
| TO_LS_SSR | 26882 | 1.949970e+02 | 1.264377e+01 | 1.455000e+02 | 2.248000e+02 | 5991 | 1718 | 25164 |
| TO_SSL_FS | 26882 | 2.052935e+02 | 1.576007e+01 | 1.476000e+02 | 2.410000e+02 | 6657 | 1718 | 25164 |
| TO_SSL_LS | 26882 | 2.015116e+02 | 1.541129e+01 | 1.503667e+02 | 2.377000e+02 | 6521 | 1718 | 25164 |
| TO_SSR_FS | 26882 | 2.030546e+02 | 1.587309e+01 | 1.448667e+02 | 2.412000e+02 | 6750 | 1718 | 25164 |
| TO_SSR_LS | 26882 | 1.983856e+02 | 1.474387e+01 | 1.408250e+02 | 2.367500e+02 | 6523 | 1718 | 25164 |
| TU_FS_M | 26882 | 1.116696e+02 | 7.413646e+00 | 8.015000e+01 | 1.304500e+02 | 4243 | 2298 | 24584 |
| TU_FS_SSL | 26882 | 1.087951e+02 | 7.716555e+00 | 8.368000e+01 | 1.342000e+02 | 4368 | 1718 | 25164 |
| TU_FS_SSR | 26882 | 1.089810e+02 | 7.231841e+00 | 8.084000e+01 | 1.348500e+02 | 4249 | 1718 | 25164 |
| TU_LS_M | 26882 | 1.086232e+02 | 6.924098e+00 | 6.540000e+01 | 1.294500e+02 | 4090 | 1959 | 24923 |
| TU_LS_SSL | 26882 | 1.152493e+02 | 7.128589e+00 | 8.065000e+01 | 1.382500e+02 | 4182 | 2736 | 24146 |
| TU_LS_SSR | 26882 | 1.150699e+02 | 7.318471e+00 | 8.833333e+01 | 1.394000e+02 | 4300 | 1959 | 24923 |
| TU_SSL_FS | 26882 | 1.257709e+02 | 9.163527e+00 | 8.836667e+01 | 1.524000e+02 | 4481 | 1718 | 25164 |
| TU_SSL_LS | 26882 | 1.153944e+02 | 7.976129e+00 | 9.183333e+01 | 1.436000e+02 | 4297 | 1718 | 25164 |
| TU_SSR_FS | 26882 | 1.221326e+02 | 8.008096e+00 | 8.710000e+01 | 1.467000e+02 | 4318 | 1718 | 25164 |
| TU_SSR_LS | 26882 | 1.141591e+02 | 7.792329e+00 | 8.290000e+01 | 1.485000e+02 | 4290 | 1718 | 25164 |
| TUNDISH_POSITION | 26882 | 1.271549e+01 | 1.156004e+01 | 0.000000e+00 | 4.200000e+01 | 32 | 1718 | 25164 |
| V__FS_G1__IR__S | 26882 | 3.216581e-01 | 1.135128e+00 | 0.000000e+00 | 8.171358e+00 | 1296 | 10513 | 16369 |
| V__FS_G2__IR__S | 26882 | 9.235807e-01 | 2.339188e+00 | 0.000000e+00 | 1.247841e+01 | 2356 | 10513 | 16369 |
| V__FS_G3__IR__S | 26882 | 2.603245e+00 | 4.623162e+00 | 0.000000e+00 | 2.758184e+01 | 4205 | 10513 | 16369 |
| V__FS_G4__IR__S | 26882 | 6.294015e+00 | 7.883289e+00 | 0.000000e+00 | 4.301559e+01 | 6762 | 10513 | 16369 |
| V__FS_G5__IR__S | 26882 | 1.486168e+01 | 1.202346e+01 | 0.000000e+00 | 7.304973e+01 | 10559 | 10513 | 16369 |
| V__FS_G6__IR__S | 26882 | 2.531957e+01 | 1.339378e+01 | 0.000000e+00 | 9.872687e+01 | 13621 | 10513 | 16369 |
| V__FS_G7__IR__S | 26882 | 3.673649e+01 | 1.082059e+01 | 0.000000e+00 | 1.611927e+02 | 16100 | 10513 | 16369 |
| VERTEILERFUELLSTAND | 26882 | 7.882090e+01 | 1.623281e+00 | 6.132000e+01 | 8.246667e+01 | 844 | 1718 | 25164 |
| VG | 26882 | 9.885491e-01 | 9.614080e-02 | 7.610000e-01 | 1.157000e+00 | 819 | 1718 | 25164 |
| VORBRAMME | 26882 | 2.670925e+02 | 2.481209e+02 | 2.200000e+01 | 5.530000e+02 | 22 | 0 | 26882 |
| WASSER_FS | 26882 | 2.661484e+00 | 1.759780e-02 | 2.563600e+00 | 2.720500e+00 | 1765 | 1718 | 25164 |
| WASSER_LS | 26882 | 2.669746e+00 | 1.256150e-02 | 2.595750e+00 | 2.719500e+00 | 1399 | 1718 | 25164 |
| WASSER_SSL | 26882 | 2.626474e-01 | 8.790700e-03 | 2.450000e-01 | 2.782000e-01 | 352 | 1718 | 25164 |
| WASSER_SSR | 26882 | 2.572780e-01 | 1.135910e-02 | 2.370000e-01 | 2.810000e-01 | 535 | 1718 | 25164 |
| WK__FS_G1__IR__S | 26882 | 4.442741e+01 | 1.564459e+02 | 0.000000e+00 | 1.077098e+03 | 1296 | 10513 | 16369 |
| WK__FS_G2__IR__S | 26882 | 8.591459e+01 | 2.168795e+02 | 0.000000e+00 | 1.202033e+03 | 2356 | 10513 | 16369 |
| WK__FS_G3__IR__S | 26882 | 1.582005e+02 | 2.797363e+02 | 0.000000e+00 | 1.679224e+03 | 4205 | 10513 | 16369 |
| WK__FS_G4__IR__S | 26882 | 2.469891e+02 | 3.074766e+02 | 0.000000e+00 | 1.692017e+03 | 6762 | 10513 | 16369 |
| WK__FS_G5__IR__S | 26882 | 3.512747e+02 | 2.814103e+02 | 0.000000e+00 | 1.725439e+03 | 10559 | 10513 | 16369 |
| WK__FS_G6__IR__S | 26882 | 3.571932e+02 | 1.870432e+02 | 0.000000e+00 | 1.296698e+03 | 13621 | 10513 | 16369 |
| WK__FS_G7__IR__S | 26882 | 3.018373e+02 | 8.826540e+01 | 0.000000e+00 | 1.437853e+03 | 16169 | 10513 | 16369 |
| WK__VS_HG_3__IR__S | 26882 | 2.081314e+01 | 1.078005e+02 | 0.000000e+00 | 9.421366e+02 | 621 | 10513 | 16369 |
| WK__VS_SP_3__IR__S | 26882 | 6.795180e-02 | 5.596544e-01 | -3.138046e-01 | 1.225219e+01 | 396 | 10513 | 16369 |
| WSPALT__FS_G1__IR__S | 26882 | 7.454170e-02 | 2.619365e-01 | 0.000000e+00 | 1.830767e+00 | 1296 | 10513 | 16369 |
| WSPALT__FS_G2__IR__S | 26882 | 9.341970e-02 | 2.353754e-01 | 0.000000e+00 | 1.252877e+00 | 2356 | 10513 | 16369 |
| WSPALT__FS_G3__IR__S | 26882 | 1.157892e-01 | 2.043523e-01 | 0.000000e+00 | 1.208468e+00 | 4211 | 10513 | 16369 |
| WSPALT__FS_G4__IR__S | 26882 | 1.295598e-01 | 1.611171e-01 | 0.000000e+00 | 8.493480e-01 | 6780 | 10513 | 16369 |
| WSPALT__FS_G5__IR__S | 26882 | 1.390093e-01 | 1.109823e-01 | 0.000000e+00 | 6.135501e-01 | 10604 | 10513 | 16369 |
| WSPALT__FS_G6__IR__S | 26882 | 1.355501e-01 | 7.061980e-02 | 0.000000e+00 | 5.206122e-01 | 13687 | 10513 | 16369 |
| WSPALT__FS_G7__IR__S | 26882 | 1.647887e-01 | 4.480130e-02 | 0.000000e+00 | 6.212929e-01 | 16169 | 10513 | 16369 |
Since last computing these descriptive satistics, 9 variables have been dropped. With that in mind, the data set still contains multiple variables with high pairwise correlations. As such, the pair-wise correlation is re-computed on the reduced data set. Then, those pairs with absolute correlation greater than or equal to .95 will be selected. These are variables which, pairwise, can explain at least 95% of the variability in one another. Since they are still so highly correlated, again only the first half of such pairs will be kept in the data set for further analysis.
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.
filter.cor <- function(x, eps) {
if (is.matrix(x)) {
cor.df <- data.frame(which(abs(x) > eps, arr.in=TRUE))
setDT(cor.df, keep.rownames = TRUE)[]
cor.df$cor <- x[which(abs(x) > eps, arr.in=TRUE)]
cor.df <- cor.df[which(cor.df$row > cor.df$col, arr.in=TRUE)]
cor.df$cn <- colnames(x[, cor.df$col])
cor.list <- cor.df$rn
grx <- glob2rx("*.*")
duplicate.list <- grepl(grx,cor.list, perl=TRUE)
cor.list <- cor.list[!duplicate.list]
cor.df$rn <- sub(pattern = "(.*)\\..*$", replacement = "\\1", cor.df$rn)
corList <- list(CorMat = cor.df, cor.list = cor.list)
return(corList)
} else {
print("no matrix!")
}
}
df_numList <- filter.cor(cormat, eps=0.95)
df_num2 <- df_numList$CorMat
cor.list <- df_numList$cor.list
cor.list <- cor.list[!cor.list %in% c("CoilID")]
cor.list <- c(cor.list, "POSITION_X.y")
length(cor.list)
## [1] 29
#List of all variables with abs. corr >= 95%
write.table(cor.list, file="anna_length_ListofVariableswithCorLT095.txt", sep="\t")
The list of variables with pairwise absolute correlation greater than or equal to .95 contains 29 variables. Before dropping such a large number of variables from the data set, it is necessary to review the summary statistics for any interesting patterns.
| rn | count | runique | unique | cor_var | na | N |
|---|---|---|---|---|---|---|
| ANST__VS_SP_3__IR__S | 1 | 1 | 1 | ANST__VS_HG_3__IR__S | 0 | 1 |
| CoilID | 1 | 1 | 1 | MAT_IDENT | 0 | 1 |
| ENTZ__ZW2_AL__IR__S | 2 | 1 | 2 | ANST__VS_HG_3__IR__S, ANST__VS_SP_3__IR__S | 0 | 2 |
| ENTZ__ZWR1_EL_SN1__IR__S | 1 | 1 | 1 | ENTZ__ZWR1_AL_SN2__IR__S | 0 | 1 |
| KEIL50__FB__IR__S | 1 | 1 | 1 | KEIL25__FB__IR__S | 0 | 1 |
| KONI_RECHTS | 1 | 1 | 1 | KONI_LINKS | 0 | 1 |
| POSITION_X | 2 | 1 | 1 | lTileID, lTileID | 0 | 2 |
| POSITION_X.y | 2 | 1 | 2 | POSITION_X.x, RIEGELLAENGE | 0 | 2 |
| RIEGELLAENGE | 2 | 1 | 2 | lTileID, POSITION_X.x | 0 | 2 |
| STRANGNUMMER | 1 | 1 | 1 | VORBRAMME | 0 | 1 |
| TEMP__FB__IR__S | 2 | 1 | 2 | TEMP__FB_1__IR__S, TEMP__FB_2__IR__S | 0 | 2 |
| TEMP__FB_2__IR__S | 1 | 1 | 1 | TEMP__FB_1__IR__S | 0 | 1 |
| TEMP__HA__IR__S | 2 | 1 | 2 | TEMP__HA_1__IR__S, TEMP__HA_2__IR__S | 0 | 2 |
| TEMP__HA_2__IR__S | 1 | 1 | 1 | TEMP__HA_1__IR__S | 0 | 1 |
| TEMP__VB__IR__S | 5 | 1 | 5 | ANST__VS_HG_3__IR__S, ANST__VS_SP_3__IR__S, ENTZ__ZW2_AL__IR__S, TEMP__VB_1__IR__S, TEMP__VB_5__IR__S | 0 | 5 |
| TEMP__VB_1__IR__S | 3 | 1 | 3 | ANST__VS_HG_3__IR__S, ANST__VS_SP_3__IR__S, ENTZ__ZW2_AL__IR__S | 0 | 3 |
| TEMP__VB_5__IR__S | 2 | 1 | 2 | ENTZ__ZW2_AL__IR__S, TEMP__VB_1__IR__S | 0 | 2 |
| V__FS_G1__IR__S | 1 | 1 | 1 | ENTZ__FS_ZW_F1__IR__S | 0 | 1 |
| WK__FS_G1__IR__S | 2 | 1 | 2 | ENTZ__FS_ZW_F1__IR__S, V__FS_G1__IR__S | 0 | 2 |
| WK__FS_G2__IR__S | 1 | 1 | 1 | V__FS_G2__IR__S | 0 | 1 |
| WK__FS_G3__IR__S | 1 | 1 | 1 | V__FS_G3__IR__S | 0 | 1 |
| WK__FS_G4__IR__S | 1 | 1 | 1 | V__FS_G4__IR__S | 0 | 1 |
| WK__FS_G5__IR__S | 1 | 1 | 1 | V__FS_G5__IR__S | 0 | 1 |
| WK__FS_G6__IR__S | 1 | 1 | 1 | V__FS_G6__IR__S | 0 | 1 |
| WK__VS_HG_3__IR__S | 5 | 1 | 5 | ANST__VS_HG_3__IR__S, ANST__VS_SP_3__IR__S, ENTZ__ZW2_AL__IR__S, TEMP__VB_1__IR__S, TEMP__VB__IR__S | 0 | 5 |
| WSPALT__FS_G1__IR__S | 3 | 1 | 3 | ENTZ__FS_ZW_F1__IR__S, V__FS_G1__IR__S, WK__FS_G1__IR__S | 0 | 3 |
| WSPALT__FS_G2__IR__S | 2 | 1 | 2 | V__FS_G2__IR__S, WK__FS_G2__IR__S | 0 | 2 |
| WSPALT__FS_G3__IR__S | 2 | 1 | 2 | V__FS_G3__IR__S, WK__FS_G3__IR__S | 0 | 2 |
| WSPALT__FS_G4__IR__S | 2 | 1 | 2 | V__FS_G4__IR__S, WK__FS_G4__IR__S | 0 | 2 |
| WSPALT__FS_G5__IR__S | 2 | 1 | 2 | V__FS_G5__IR__S, WK__FS_G5__IR__S | 0 | 2 |
| WSPALT__FS_G6__IR__S | 1 | 1 | 1 | V__FS_G6__IR__S | 0 | 1 |
As no immediately worrying information can be seen in the table above, all variables listed in the cor.list are dropped from the dataset.
#dropping corresponding columns
df_num <- df_num %>%
dplyr::select(-cor.list)
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.
One can now observe clearly the differences between the original correlation matrix, and our matrix built on the reduced data set. The remaining data, although still highly correlated in some ways, is less extremely correlated in general. When comparing between the above correlation matrix, and previous matrices, note that each indivdual matrix is ordered such that highly correlated variables are listed together, which changes the order of the variables presented in each matrix, as variables are dropped.
Having dropped a notable number of variables, the descriptive statistics are computed one final time for the numeric variables and are reviewed for interesting patterns.
| slab_attr | count | mean | sd | min | max | unique | na | N |
|---|---|---|---|---|---|---|---|---|
| ANST__VS_HG_3__IR__S | 26882 | 4.564310e-02 | 2.360949e-01 | 0.000000e+00 | 2.048611e+00 | 621 | 10513 | 16369 |
| ARGON_DRUCK_ST | 26882 | 5.986625e+01 | 1.870098e+01 | 1.800000e+01 | 1.000000e+02 | 426 | 1718 | 25164 |
| ARGON_DURCHFL_DUSCH | 26882 | 1.152141e+02 | 3.499762e+01 | 5.200000e-01 | 1.900000e+02 | 4033 | 1718 | 25164 |
| ARGON_DURCHFL_ST | 26882 | 8.326040e+00 | 5.982072e-01 | 3.410000e+00 | 1.108000e+01 | 1050 | 1718 | 25164 |
| CHARGEN_NR | 26882 | 4.490450e+05 | 2.574589e+05 | 1.631710e+05 | 7.226910e+05 | 249 | 0 | 26882 |
| Class.14 | 26882 | 2.823450e-02 | 4.637674e-01 | 0.000000e+00 | 3.700000e+01 | 17 | 0 | 26882 |
| Class.15 | 26882 | 2.652330e-02 | 4.595173e-01 | 0.000000e+00 | 2.800000e+01 | 20 | 0 | 26882 |
| Class.4 | 26882 | 4.746671e-01 | 1.265641e+00 | 0.000000e+00 | 3.500000e+01 | 27 | 0 | 26882 |
| CoilID | 26882 | 1.914100e+07 | 3.678003e+05 | 1.865870e+07 | 2.002710e+07 | 657 | 0 | 26882 |
| DICKE__AL__IR__S | 26882 | 1.511021e-01 | 4.236360e-02 | 0.000000e+00 | 6.130659e-01 | 16028 | 10513 | 16369 |
| DICKE__HA_1__IR__S | 26882 | 1.516607e-01 | 4.182080e-02 | 0.000000e+00 | 6.130659e-01 | 16086 | 10513 | 16369 |
| DICKE__HA_2__IR__S | 26882 | 1.459174e-01 | 4.722860e-02 | 0.000000e+00 | 4.192426e-01 | 15569 | 10513 | 16369 |
| DICKE__VB__IR__S | 26882 | 4.809820e-02 | 2.543457e-01 | 0.000000e+00 | 2.269721e+00 | 596 | 10513 | 16369 |
| DT_FS | 26882 | 6.856955e+01 | 4.827426e+00 | 5.070000e+01 | 8.510000e+01 | 2260 | 1718 | 25164 |
| DT_LS | 26882 | 7.040780e+01 | 5.088746e+00 | 5.050000e+01 | 8.580000e+01 | 2415 | 1718 | 25164 |
| DT_SSL | 26882 | 5.877077e+01 | 4.419236e+00 | 4.358000e+01 | 6.970000e+01 | 2175 | 1718 | 25164 |
| DT_SSR | 26882 | 5.354157e+01 | 3.564115e+00 | 4.070000e+01 | 6.600000e+01 | 2055 | 1718 | 25164 |
| ENTZ__FS_ZW_F1__IR__S | 26882 | 3.392100e-03 | 1.191530e-02 | 0.000000e+00 | 8.333330e-02 | 89 | 10513 | 16369 |
| ENTZ__FS_ZW_F2__IR__S | 26882 | 5.956500e-03 | 1.562020e-02 | 0.000000e+00 | 8.333330e-02 | 96 | 10513 | 16369 |
| ENTZ__ZW_OF_AL__IR__S | 26882 | 3.000000e-06 | 2.196000e-04 | 0.000000e+00 | 1.639340e-02 | 4 | 10513 | 16369 |
| ENTZ__ZWR1_AL_SN2__IR__S | 26882 | 1.111000e-04 | 2.261600e-03 | 0.000000e+00 | 6.666670e-02 | 23 | 10513 | 16369 |
| FUELLSTAND | 26882 | 7.497974e+01 | 6.737898e-01 | 7.200000e+01 | 7.850000e+01 | 97 | 1718 | 25164 |
| KEIL25__FB__IR__S | 26882 | -2.246960e-02 | 8.339448e-01 | -4.823868e+00 | 3.480015e+00 | 15190 | 10513 | 16369 |
| KEIL40__FB__IR__S | 26882 | -1.268257e-01 | 7.662683e-01 | -4.087126e+00 | 3.306016e+00 | 15755 | 10513 | 16369 |
| KONI_LINKS | 26882 | 1.119920e+01 | 6.437812e-01 | 3.700000e+00 | 1.867500e+01 | 204 | 1718 | 25164 |
| Length.max.slab | 26882 | 1.265407e+03 | 5.041442e+01 | 5.280000e+02 | 1.312000e+03 | 30 | 4034 | 22848 |
| lTileID | 26882 | 2.433464e+02 | 1.369660e+02 | 1.000000e+00 | 5.100000e+02 | 511 | 7 | 26875 |
| MAT_IDENT | 26882 | 2.985188e+07 | 3.937609e+05 | 2.917029e+07 | 3.077100e+07 | 657 | 0 | 26882 |
| NETTO_PFANNENINHALT | 26882 | 1.648394e+02 | 7.692643e+01 | 0.000000e+00 | 4.040000e+02 | 12751 | 1718 | 25164 |
| PLATTENDICKE_SSL | 26882 | 4.829999e+01 | 2.019073e+00 | 4.320000e+01 | 5.000000e+01 | 40 | 1718 | 25164 |
| PLATTENDICKE_SSR | 26882 | 4.745850e+01 | 2.089287e+00 | 4.329000e+01 | 5.000000e+01 | 35 | 1718 | 25164 |
| POSITION_X.x | 26882 | 5.979768e+02 | 3.447785e+02 | 5.175509e+00 | 1.841143e+03 | 25133 | 1718 | 25164 |
| PR_40__FB__IR__S | 26882 | 1.800993e+00 | 7.283150e-01 | 0.000000e+00 | 8.111780e+00 | 15842 | 10513 | 16369 |
| RISS__HA_AS__IR__S | 26882 | 3.270000e-05 | 3.355500e-03 | 0.000000e+00 | 4.107143e-01 | 4 | 10513 | 16369 |
| RISS__HA_BS__IR__S | 26882 | 1.509000e-04 | 8.996600e-03 | 0.000000e+00 | 7.083333e-01 | 9 | 10513 | 16369 |
| STOPFENSTELLUNG | 26882 | 5.489166e+01 | 5.534024e+00 | 4.300000e+01 | 7.000000e+01 | 262 | 1718 | 25164 |
| STRANGBREITE | 26882 | 2.490284e+03 | 1.190497e+02 | 2.151000e+03 | 2.577000e+03 | 118 | 1718 | 25164 |
| TEMP__FB_1__IR__S | 26882 | 4.488583e+01 | 1.228837e+01 | 0.000000e+00 | 1.951373e+02 | 16109 | 10513 | 16369 |
| TEMP__FB_3__IR__S | 26882 | 4.485969e+01 | 1.367100e+01 | 0.000000e+00 | 4.520970e+02 | 16111 | 10513 | 16369 |
| TEMP__HA__SR__MAX | 26882 | 3.561846e+01 | 4.902349e+01 | 0.000000e+00 | 6.400000e+02 | 127 | 10513 | 16369 |
| TEMP__HA_1__IR__S | 26882 | 3.137055e+01 | 8.701447e+00 | 0.000000e+00 | 1.549556e+02 | 16112 | 10513 | 16369 |
| TEMP__HA_4__IR__S | 26882 | 2.891099e+01 | 7.921209e+00 | 0.000000e+00 | 1.079578e+02 | 16118 | 10513 | 16369 |
| TEMP__HA_5__IR__S | 26882 | 3.086202e+01 | 8.576830e+00 | 0.000000e+00 | 1.451340e+02 | 16099 | 10513 | 16369 |
| TM_FS_M | 26882 | 1.288398e+02 | 1.178362e+01 | 9.320000e+01 | 1.566000e+02 | 5450 | 1718 | 25164 |
| TM_FS_SSL | 26882 | 1.277399e+02 | 1.056593e+01 | 9.333333e+01 | 1.597000e+02 | 5226 | 1718 | 25164 |
| TM_FS_SSR | 26882 | 1.285015e+02 | 1.120327e+01 | 9.915000e+01 | 1.614500e+02 | 5206 | 1718 | 25164 |
| TM_LS_M | 26882 | 1.265216e+02 | 1.003511e+01 | 8.960000e+01 | 1.510000e+02 | 4937 | 1718 | 25164 |
| TM_LS_SSL | 26882 | 1.318702e+02 | 1.032792e+01 | 1.014000e+02 | 1.628667e+02 | 4988 | 1718 | 25164 |
| TM_LS_SSR | 26882 | 1.321628e+02 | 1.085231e+01 | 9.990000e+01 | 1.691000e+02 | 5321 | 1718 | 25164 |
| TM_SSL_FS | 26882 | 1.453342e+02 | 1.446186e+01 | 1.041750e+02 | 1.791000e+02 | 5156 | 1718 | 25164 |
| TM_SSL_LS | 26882 | 1.329567e+02 | 1.021378e+01 | 1.073667e+02 | 1.645000e+02 | 4517 | 1718 | 25164 |
| TM_SSR_FS | 26882 | 1.376102e+02 | 1.083885e+01 | 1.031500e+02 | 1.767000e+02 | 4857 | 1718 | 25164 |
| TM_SSR_LS | 26882 | 1.301028e+02 | 8.983733e+00 | 1.016667e+02 | 1.598000e+02 | 4312 | 1718 | 25164 |
| TO_FS_M | 26882 | 1.846690e+02 | 1.613626e+01 | 1.339333e+02 | 2.244500e+02 | 6683 | 1718 | 25164 |
| TO_FS_SSL | 26882 | 1.885192e+02 | 2.793312e+01 | 8.000000e-01 | 7.977000e+02 | 6835 | 2620 | 24262 |
| TO_FS_SSR | 26882 | 1.913317e+02 | 1.379441e+01 | 1.441750e+02 | 2.262500e+02 | 6299 | 1718 | 25164 |
| TO_LS_M | 26882 | 1.863830e+02 | 1.294765e+01 | 1.388000e+02 | 2.216500e+02 | 5884 | 1718 | 25164 |
| TO_LS_SSL | 26882 | 1.961198e+02 | 1.250285e+01 | 1.511500e+02 | 2.289000e+02 | 5864 | 1718 | 25164 |
| TO_LS_SSR | 26882 | 1.949970e+02 | 1.264377e+01 | 1.455000e+02 | 2.248000e+02 | 5991 | 1718 | 25164 |
| TO_SSL_FS | 26882 | 2.052935e+02 | 1.576007e+01 | 1.476000e+02 | 2.410000e+02 | 6657 | 1718 | 25164 |
| TO_SSL_LS | 26882 | 2.015116e+02 | 1.541129e+01 | 1.503667e+02 | 2.377000e+02 | 6521 | 1718 | 25164 |
| TO_SSR_FS | 26882 | 2.030546e+02 | 1.587309e+01 | 1.448667e+02 | 2.412000e+02 | 6750 | 1718 | 25164 |
| TO_SSR_LS | 26882 | 1.983856e+02 | 1.474387e+01 | 1.408250e+02 | 2.367500e+02 | 6523 | 1718 | 25164 |
| TU_FS_M | 26882 | 1.116696e+02 | 7.413646e+00 | 8.015000e+01 | 1.304500e+02 | 4243 | 2298 | 24584 |
| TU_FS_SSL | 26882 | 1.087951e+02 | 7.716555e+00 | 8.368000e+01 | 1.342000e+02 | 4368 | 1718 | 25164 |
| TU_FS_SSR | 26882 | 1.089810e+02 | 7.231841e+00 | 8.084000e+01 | 1.348500e+02 | 4249 | 1718 | 25164 |
| TU_LS_M | 26882 | 1.086232e+02 | 6.924098e+00 | 6.540000e+01 | 1.294500e+02 | 4090 | 1959 | 24923 |
| TU_LS_SSL | 26882 | 1.152493e+02 | 7.128589e+00 | 8.065000e+01 | 1.382500e+02 | 4182 | 2736 | 24146 |
| TU_LS_SSR | 26882 | 1.150699e+02 | 7.318471e+00 | 8.833333e+01 | 1.394000e+02 | 4300 | 1959 | 24923 |
| TU_SSL_FS | 26882 | 1.257709e+02 | 9.163527e+00 | 8.836667e+01 | 1.524000e+02 | 4481 | 1718 | 25164 |
| TU_SSL_LS | 26882 | 1.153944e+02 | 7.976129e+00 | 9.183333e+01 | 1.436000e+02 | 4297 | 1718 | 25164 |
| TU_SSR_FS | 26882 | 1.221326e+02 | 8.008096e+00 | 8.710000e+01 | 1.467000e+02 | 4318 | 1718 | 25164 |
| TU_SSR_LS | 26882 | 1.141591e+02 | 7.792329e+00 | 8.290000e+01 | 1.485000e+02 | 4290 | 1718 | 25164 |
| TUNDISH_POSITION | 26882 | 1.271549e+01 | 1.156004e+01 | 0.000000e+00 | 4.200000e+01 | 32 | 1718 | 25164 |
| V__FS_G2__IR__S | 26882 | 9.235807e-01 | 2.339188e+00 | 0.000000e+00 | 1.247841e+01 | 2356 | 10513 | 16369 |
| V__FS_G3__IR__S | 26882 | 2.603245e+00 | 4.623162e+00 | 0.000000e+00 | 2.758184e+01 | 4205 | 10513 | 16369 |
| V__FS_G4__IR__S | 26882 | 6.294015e+00 | 7.883289e+00 | 0.000000e+00 | 4.301559e+01 | 6762 | 10513 | 16369 |
| V__FS_G5__IR__S | 26882 | 1.486168e+01 | 1.202346e+01 | 0.000000e+00 | 7.304973e+01 | 10559 | 10513 | 16369 |
| V__FS_G6__IR__S | 26882 | 2.531957e+01 | 1.339378e+01 | 0.000000e+00 | 9.872687e+01 | 13621 | 10513 | 16369 |
| V__FS_G7__IR__S | 26882 | 3.673649e+01 | 1.082059e+01 | 0.000000e+00 | 1.611927e+02 | 16100 | 10513 | 16369 |
| VERTEILERFUELLSTAND | 26882 | 7.882090e+01 | 1.623281e+00 | 6.132000e+01 | 8.246667e+01 | 844 | 1718 | 25164 |
| VG | 26882 | 9.885491e-01 | 9.614080e-02 | 7.610000e-01 | 1.157000e+00 | 819 | 1718 | 25164 |
| VORBRAMME | 26882 | 2.670925e+02 | 2.481209e+02 | 2.200000e+01 | 5.530000e+02 | 22 | 0 | 26882 |
| WASSER_FS | 26882 | 2.661484e+00 | 1.759780e-02 | 2.563600e+00 | 2.720500e+00 | 1765 | 1718 | 25164 |
| WASSER_LS | 26882 | 2.669746e+00 | 1.256150e-02 | 2.595750e+00 | 2.719500e+00 | 1399 | 1718 | 25164 |
| WASSER_SSL | 26882 | 2.626474e-01 | 8.790700e-03 | 2.450000e-01 | 2.782000e-01 | 352 | 1718 | 25164 |
| WASSER_SSR | 26882 | 2.572780e-01 | 1.135910e-02 | 2.370000e-01 | 2.810000e-01 | 535 | 1718 | 25164 |
| WK__FS_G7__IR__S | 26882 | 3.018373e+02 | 8.826540e+01 | 0.000000e+00 | 1.437853e+03 | 16169 | 10513 | 16369 |
| WK__VS_SP_3__IR__S | 26882 | 6.795180e-02 | 5.596544e-01 | -3.138046e-01 | 1.225219e+01 | 396 | 10513 | 16369 |
| WSPALT__FS_G7__IR__S | 26882 | 1.647887e-01 | 4.480130e-02 | 0.000000e+00 | 6.212929e-01 | 16169 | 10513 | 16369 |
The working data set has now been reduced to 90 variables, all non-constant. Having reduced the numeric variables as much as seems required, these changes may be applied to the main data frame, df.
#Reduce accordingly
df <- df %>%
dplyr::select(-cor1.list, -cor.list, -var.sd0) %>%
dplyr::rename(
POSITION_X = POSITION_X.x
)
#Dropping redundancies and piece related vars
df <- df %>%
dplyr::select(-CHARGEN_NR, -VORBRAMME, -Length.max.slab)
Our main data frame now has 88 variables and 26882 observations. With no additional motivation to reduce the data set further, the error rates can now be modeled, as a function of all remaining production variables, in order to determine which variables seem to have the most significant impact on the correct classification of a surface error.
As the distributions of the error rates are highly skewed, in that, with a large number of observations there are relatively many zero values, one first applies a log transform to the (data +1), adjusting for the skew while simultaneously avoiding the issue of undefined log(0) option.
#Log transform of error counts
df1 <- df%>%
dplyr::mutate(lnClass.4 = log(Class.4 + 1),
lnClass.14 = log(Class.14 + 1),
lnClass.15 = log(Class.15 + 1))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Above, the log transformation is applied to the data. The histograms above show the spread of the data before and after the transform. The greatest effect of this transform can be seen in the Class 4 errors, but does not cause notable changes in the spread of the other two error classes. As such, the log transformation will only be utilized with the Class 4 errors.
linmodlnC4.pred <- lm(lnClass.4~.- CoilID - MAT_IDENT - lTileID - Class.4 - Class.14 - Class.15 - lnClass.14 - lnClass.15, data =df1)
##
## Call:
## lm(formula = lnClass.4 ~ . - CoilID - MAT_IDENT - lTileID - Class.4 -
## Class.14 - Class.15 - lnClass.14 - lnClass.15, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81393 -0.25878 -0.20246 -0.09486 3.02660
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.045e+00 1.386e+00 2.197 0.028004 *
## POSITION_X -2.254e-05 4.187e-05 -0.538 0.590412
## VORG_HAUPTAGGREGATBRSG01 -1.344e-02 1.092e-02 -1.231 0.218241
## TO_FS_SSL 9.799e-05 1.542e-04 0.635 0.525236
## TO_FS_M -1.340e-03 4.123e-04 -3.250 0.001158 **
## TO_FS_SSR -7.721e-04 6.162e-04 -1.253 0.210224
## TO_SSR_FS 1.420e-03 8.712e-04 1.630 0.103048
## TO_SSR_LS -2.236e-03 8.180e-04 -2.734 0.006266 **
## TO_LS_SSR 2.533e-04 6.496e-04 0.390 0.696632
## TO_LS_M 7.052e-04 5.013e-04 1.407 0.159508
## TO_LS_SSL 3.637e-04 6.891e-04 0.528 0.597618
## TO_SSL_LS -3.274e-04 8.249e-04 -0.397 0.691467
## TO_SSL_FS 4.086e-05 9.228e-04 0.044 0.964680
## TM_FS_SSL 5.095e-04 6.625e-04 0.769 0.441896
## TM_FS_M -1.217e-03 6.516e-04 -1.867 0.061900 .
## TM_FS_SSR -7.560e-04 7.099e-04 -1.065 0.286942
## TM_SSR_FS 1.306e-03 9.854e-04 1.326 0.184911
## TM_SSR_LS -9.450e-04 1.156e-03 -0.817 0.413788
## TM_LS_SSR -1.378e-03 6.347e-04 -2.172 0.029905 *
## TM_LS_M -2.125e-04 8.027e-04 -0.265 0.791170
## TM_LS_SSL 1.666e-04 6.990e-04 0.238 0.811601
## TM_SSL_LS -1.847e-03 9.021e-04 -2.047 0.040667 *
## TM_SSL_FS 2.446e-03 7.413e-04 3.299 0.000972 ***
## TU_FS_SSL -7.550e-04 9.822e-04 -0.769 0.442102
## TU_FS_M 2.049e-03 1.153e-03 1.777 0.075521 .
## TU_FS_SSR -4.318e-04 1.107e-03 -0.390 0.696440
## TU_SSR_FS -4.822e-04 1.124e-03 -0.429 0.667897
## TU_SSR_LS -7.924e-04 1.195e-03 -0.663 0.507194
## TU_LS_SSR 2.063e-03 1.010e-03 2.042 0.041156 *
## TU_LS_M 3.187e-03 1.264e-03 2.521 0.011723 *
## TU_LS_SSL -1.052e-03 1.016e-03 -1.035 0.300674
## TU_SSL_LS 9.481e-04 1.117e-03 0.849 0.396103
## TU_SSL_FS -1.399e-03 1.112e-03 -1.258 0.208411
## DT_SSR -3.572e-03 3.648e-03 -0.979 0.327524
## DT_SSL 2.018e-03 4.089e-03 0.494 0.621555
## DT_FS 8.808e-03 3.890e-03 2.264 0.023568 *
## DT_LS -1.076e-02 4.267e-03 -2.522 0.011688 *
## VG 4.026e-01 1.336e-01 3.014 0.002582 **
## FUELLSTAND 5.246e-03 5.824e-03 0.901 0.367686
## STRANGBREITE -1.950e-05 1.092e-04 -0.179 0.858271
## WASSER_SSR 1.582e+00 1.452e+00 1.089 0.275961
## WASSER_SSL -3.724e+00 1.763e+00 -2.112 0.034668 *
## WASSER_FS 1.568e-02 2.325e-01 0.067 0.946215
## WASSER_LS -6.082e-01 3.444e-01 -1.766 0.077401 .
## STOPFENSTELLUNG -2.706e-03 9.876e-04 -2.740 0.006157 **
## PLATTENDICKE_SSL -1.105e-02 5.876e-03 -1.881 0.059994 .
## PLATTENDICKE_SSR -5.142e-03 6.111e-03 -0.841 0.400124
## ARGON_DRUCK_ST -3.654e-04 2.768e-04 -1.320 0.186863
## ARGON_DURCHFL_ST 7.070e-04 7.740e-03 0.091 0.927226
## ARGON_DURCHFL_DUSCH -8.056e-04 2.050e-04 -3.929 8.56e-05 ***
## TUNDISH_POSITION 3.162e-04 4.550e-04 0.695 0.487003
## VERTEILERFUELLSTAND 2.349e-03 3.525e-03 0.666 0.505260
## NETTO_PFANNENINHALT -8.501e-05 5.878e-05 -1.446 0.148108
## KONI_LINKS -9.322e-03 1.058e-02 -0.881 0.378509
## ANST__VS_HG_3__IR__S 3.867e-02 4.738e-02 0.816 0.414367
## DICKE__AL__IR__S -2.305e-01 1.660e-01 -1.389 0.164917
## DICKE__HA_1__IR__S -2.860e-02 1.179e-01 -0.243 0.808334
## DICKE__HA_2__IR__S 1.286e-01 1.423e-01 0.903 0.366483
## DICKE__VB__IR__S -2.147e-02 4.097e-02 -0.524 0.600294
## ENTZ__FS_ZW_F1__IR__S 1.300e+00 5.197e-01 2.501 0.012409 *
## ENTZ__FS_ZW_F2__IR__S 6.093e-01 6.017e-01 1.013 0.311270
## ENTZ__ZWR1_AL_SN2__IR__S 5.213e+00 1.947e+00 2.678 0.007423 **
## ENTZ__ZW_OF_AL__IR__S 1.176e+01 1.744e+01 0.674 0.500182
## KEIL25__FB__IR__S -6.419e-03 6.359e-03 -1.009 0.312826
## KEIL40__FB__IR__S 2.183e-02 6.616e-03 3.299 0.000973 ***
## PR_40__FB__IR__S -5.267e-03 6.040e-03 -0.872 0.383208
## RISS__HA_AS__IR__S -9.389e-01 1.080e+00 -0.869 0.384858
## RISS__HA_BS__IR__S -6.366e-02 5.195e-01 -0.123 0.902476
## TEMP__FB_1__IR__S 4.224e-04 3.939e-04 1.073 0.283508
## TEMP__FB_3__IR__S -2.760e-04 3.614e-04 -0.764 0.445018
## TEMP__HA_1__IR__S 2.599e-03 7.809e-04 3.329 0.000875 ***
## TEMP__HA_4__IR__S -4.546e-04 8.165e-04 -0.557 0.577678
## TEMP__HA_5__IR__S -1.721e-03 5.944e-04 -2.895 0.003796 **
## TEMP__HA__SR__MAX -1.989e-05 9.419e-05 -0.211 0.832728
## V__FS_G2__IR__S -2.614e-03 4.086e-03 -0.640 0.522302
## V__FS_G3__IR__S 1.530e-03 1.317e-03 1.162 0.245249
## V__FS_G4__IR__S -7.725e-04 7.813e-04 -0.989 0.322858
## V__FS_G5__IR__S 7.209e-04 4.857e-04 1.484 0.137705
## V__FS_G6__IR__S 6.037e-04 3.552e-04 1.699 0.089282 .
## V__FS_G7__IR__S -6.164e-04 1.133e-03 -0.544 0.586501
## WK__FS_G7__IR__S 1.318e-04 1.231e-04 1.071 0.284151
## WK__VS_SP_3__IR__S 1.660e-03 1.032e-02 0.161 0.872271
## WSPALT__FS_G7__IR__S -1.281e-01 2.922e-01 -0.438 0.661134
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4625 on 13963 degrees of freedom
## (12836 observations deleted due to missingness)
## Multiple R-squared: 0.02408, Adjusted R-squared: 0.01835
## F-statistic: 4.202 on 82 and 13963 DF, p-value: < 2.2e-16
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| POSITION_X | 1 | 12.1052258 | 12.1052258 | 56.6013309 | 0.0000000 |
| VORG_HAUPTAGGREGAT | 1 | 0.3064707 | 0.3064707 | 1.4329886 | 0.2312976 |
| TO_FS_SSL | 1 | 0.3044748 | 0.3044748 | 1.4236563 | 0.2328228 |
| TO_FS_M | 1 | 0.9983175 | 0.9983175 | 4.6679096 | 0.0307483 |
| TO_FS_SSR | 1 | 0.2452430 | 0.2452430 | 1.1467016 | 0.2842592 |
| TO_SSR_FS | 1 | 0.4734872 | 0.4734872 | 2.2139202 | 0.1367933 |
| TO_SSR_LS | 1 | 2.2104284 | 2.2104284 | 10.3354694 | 0.0013079 |
| TO_LS_SSR | 1 | 0.1913628 | 0.1913628 | 0.8947698 | 0.3442044 |
| TO_LS_M | 1 | 1.2948613 | 1.2948613 | 6.0544822 | 0.0138831 |
| TO_LS_SSL | 1 | 0.4928747 | 0.4928747 | 2.3045718 | 0.1290165 |
| TO_SSL_LS | 1 | 0.4538694 | 0.4538694 | 2.1221920 | 0.1452011 |
| TO_SSL_FS | 1 | 2.7332745 | 2.7332745 | 12.7801808 | 0.0003515 |
| TM_FS_SSL | 1 | 0.0882784 | 0.0882784 | 0.4127701 | 0.5205774 |
| TM_FS_M | 1 | 1.3246962 | 1.3246962 | 6.1939832 | 0.0128301 |
| TM_FS_SSR | 1 | 0.9366880 | 0.9366880 | 4.3797436 | 0.0363865 |
| TM_SSR_FS | 1 | 1.1467355 | 1.1467355 | 5.3618791 | 0.0205960 |
| TM_SSR_LS | 1 | 0.5801224 | 0.5801224 | 2.7125226 | 0.0995859 |
| TM_LS_SSR | 1 | 1.7378946 | 1.7378946 | 8.1260067 | 0.0043699 |
| TM_LS_M | 1 | 0.7700484 | 0.7700484 | 3.6005745 | 0.0577802 |
| TM_LS_SSL | 1 | 0.3290349 | 0.3290349 | 1.5384935 | 0.2148630 |
| TM_SSL_LS | 1 | 0.7101211 | 0.7101211 | 3.3203677 | 0.0684479 |
| TM_SSL_FS | 1 | 4.0556880 | 4.0556880 | 18.9634910 | 0.0000134 |
| TU_FS_SSL | 1 | 0.3663775 | 0.3663775 | 1.7130991 | 0.1906053 |
| TU_FS_M | 1 | 2.5541811 | 2.5541811 | 11.9427800 | 0.0005502 |
| TU_FS_SSR | 1 | 0.1974841 | 0.1974841 | 0.9233916 | 0.3366027 |
| TU_SSR_FS | 1 | 1.3126413 | 1.3126413 | 6.1376172 | 0.0132453 |
| TU_SSR_LS | 1 | 0.2708226 | 0.2708226 | 1.2663060 | 0.2604802 |
| TU_LS_SSR | 1 | 1.1134347 | 1.1134347 | 5.2061720 | 0.0225218 |
| TU_LS_M | 1 | 0.2590926 | 0.2590926 | 1.2114590 | 0.2710623 |
| TU_LS_SSL | 1 | 0.3614753 | 0.3614753 | 1.6901775 | 0.1935990 |
| TU_SSL_LS | 1 | 0.0086773 | 0.0086773 | 0.0405731 | 0.8403671 |
| TU_SSL_FS | 1 | 1.1712469 | 1.1712469 | 5.4764889 | 0.0192879 |
| DT_SSR | 1 | 0.8342697 | 0.8342697 | 3.9008587 | 0.0482811 |
| DT_SSL | 1 | 0.3408300 | 0.3408300 | 1.5936448 | 0.2068273 |
| DT_FS | 1 | 0.2942908 | 0.2942908 | 1.3760380 | 0.2407972 |
| DT_LS | 1 | 1.5307590 | 1.5307590 | 7.1574872 | 0.0074739 |
| VG | 1 | 3.8163329 | 3.8163329 | 17.8443201 | 0.0000241 |
| FUELLSTAND | 1 | 0.2139371 | 0.2139371 | 1.0003221 | 0.3172499 |
| STRANGBREITE | 1 | 0.1871895 | 0.1871895 | 0.8752564 | 0.3495204 |
| WASSER_SSR | 1 | 0.3589889 | 0.3589889 | 1.6785519 | 0.1951384 |
| WASSER_SSL | 1 | 0.4405181 | 0.4405181 | 2.0597643 | 0.1512560 |
| WASSER_FS | 1 | 0.0093231 | 0.0093231 | 0.0435930 | 0.8346158 |
| WASSER_LS | 1 | 0.8040779 | 0.8040779 | 3.7596887 | 0.0525225 |
| STOPFENSTELLUNG | 1 | 1.3107257 | 1.3107257 | 6.1286604 | 0.0133125 |
| PLATTENDICKE_SSL | 1 | 0.8622883 | 0.8622883 | 4.0318674 | 0.0446677 |
| PLATTENDICKE_SSR | 1 | 0.3175894 | 0.3175894 | 1.4849773 | 0.2230180 |
| ARGON_DRUCK_ST | 1 | 0.2258808 | 0.2258808 | 1.0561681 | 0.3041086 |
| ARGON_DURCHFL_ST | 1 | 0.0140627 | 0.0140627 | 0.0657542 | 0.7976259 |
| ARGON_DURCHFL_DUSCH | 1 | 3.0933738 | 3.0933738 | 14.4639247 | 0.0001435 |
| TUNDISH_POSITION | 1 | 0.0427591 | 0.0427591 | 0.1999318 | 0.6547829 |
| VERTEILERFUELLSTAND | 1 | 0.1813750 | 0.1813750 | 0.8480689 | 0.3571151 |
| NETTO_PFANNENINHALT | 1 | 0.3953274 | 0.3953274 | 1.8484625 | 0.1739843 |
| KONI_LINKS | 1 | 0.1736209 | 0.1736209 | 0.8118125 | 0.3676005 |
| ANST__VS_HG_3__IR__S | 1 | 3.5701146 | 3.5701146 | 16.6930581 | 0.0000442 |
| DICKE__AL__IR__S | 1 | 0.3173051 | 0.3173051 | 1.4836477 | 0.2232253 |
| DICKE__HA_1__IR__S | 1 | 0.0055638 | 0.0055638 | 0.0260150 | 0.8718660 |
| DICKE__HA_2__IR__S | 1 | 0.1528517 | 0.1528517 | 0.7147003 | 0.3979023 |
| DICKE__VB__IR__S | 1 | 0.1091491 | 0.1091491 | 0.5103566 | 0.4749965 |
| ENTZ__FS_ZW_F1__IR__S | 1 | 1.5400281 | 1.5400281 | 7.2008271 | 0.0072956 |
| ENTZ__FS_ZW_F2__IR__S | 1 | 0.0453203 | 0.0453203 | 0.2119077 | 0.6452834 |
| ENTZ__ZWR1_AL_SN2__IR__S | 1 | 2.1054425 | 2.1054425 | 9.8445785 | 0.0017069 |
| ENTZ__ZW_OF_AL__IR__S | 1 | 0.0949082 | 0.0949082 | 0.4437697 | 0.5053196 |
| KEIL25__FB__IR__S | 1 | 0.0723374 | 0.0723374 | 0.3382336 | 0.5608600 |
| KEIL40__FB__IR__S | 1 | 2.2036277 | 2.2036277 | 10.3036705 | 0.0013307 |
| PR_40__FB__IR__S | 1 | 0.0819987 | 0.0819987 | 0.3834077 | 0.5357952 |
| RISS__HA_AS__IR__S | 1 | 0.1979669 | 0.1979669 | 0.9256488 | 0.3360128 |
| RISS__HA_BS__IR__S | 1 | 0.0003147 | 0.0003147 | 0.0014716 | 0.9693997 |
| TEMP__FB_1__IR__S | 1 | 0.4173838 | 0.4173838 | 1.9515933 | 0.1624375 |
| TEMP__FB_3__IR__S | 1 | 0.0038659 | 0.0038659 | 0.0180761 | 0.8930506 |
| TEMP__HA_1__IR__S | 1 | 2.3284451 | 2.3284451 | 10.8872891 | 0.0009707 |
| TEMP__HA_4__IR__S | 1 | 0.0025675 | 0.0025675 | 0.0120051 | 0.9127539 |
| TEMP__HA_5__IR__S | 1 | 1.7216025 | 1.7216025 | 8.0498286 | 0.0045574 |
| TEMP__HA__SR__MAX | 1 | 0.0251762 | 0.0251762 | 0.1177183 | 0.7315276 |
| V__FS_G2__IR__S | 1 | 0.0906945 | 0.0906945 | 0.4240670 | 0.5149253 |
| V__FS_G3__IR__S | 1 | 0.0295655 | 0.0295655 | 0.1382416 | 0.7100411 |
| V__FS_G4__IR__S | 1 | 0.3805337 | 0.3805337 | 1.7792904 | 0.1822582 |
| V__FS_G5__IR__S | 1 | 0.7436703 | 0.7436703 | 3.4772362 | 0.0622395 |
| V__FS_G6__IR__S | 1 | 0.6349043 | 0.6349043 | 2.9686704 | 0.0849137 |
| V__FS_G7__IR__S | 1 | 0.0083502 | 0.0083502 | 0.0390438 | 0.8433649 |
| WK__FS_G7__IR__S | 1 | 0.2152435 | 0.2152435 | 1.0064305 | 0.3157769 |
| WK__VS_SP_3__IR__S | 1 | 0.0058208 | 0.0058208 | 0.0272168 | 0.8689660 |
| WSPALT__FS_G7__IR__S | 1 | 0.0410964 | 0.0410964 | 0.1921576 | 0.6611337 |
| Residuals | 13963 | 2986.2419272 | 0.2138682 | NA | NA |
The first model used in this analysis is a linear model. The prediction equation used takes into account all variables, excluding those used for identification and those discussing the error counts themselves. Although the data is not expected to behave linearly, one still first applies a linear model, both to have a base line of possibly significant variables, and to review the usefullness of any specific predictor equations. As we are reviewing the data for all possible significant variables, we start with the full set of variables in our predictive equation, and reduce if necessary from there.
From the summary and ANOVA information presented above one can determine the variables which are statistically significant in predicting the occurance of log(class 4) errors, when the data is modeled using a linear model. Significant variables as according to this model are:
Although all of these variables are marked as significant by the model, the model is fitting very poorly to the data, with an adjusted R2 value of 0.018. As the linear model is explaining so little of the variability in the data, it is necessary to use other models going forward for selecting significant variables.
Tree models and random forest models will be used for all error classes to test for variable significance. For this work, conditional trees and conditional random forests have been used, by means of the ctree and cforest functions in the partykit and party packages, respectively. The party package can be utilized for both ctree and cforest, but the updated version, partykit, has improved upon the implementation of the old ctree function. Cforest is not yet fully developed in the partykit package, and is not used here. Conditional trees were chosen for this analysis to avoid the bais seen in rpart trees. Rpart trees tend to select node variables with the greatest potential for many splits, while conditional trees implement a selection algorithm specifically designed to avoid this bais.
set.seed(80542)
Here a random seed is set to allow for reproducable results.
In order to model the log error counts with a conditional tree, the same prediction equation used in the linear model is implemented. As such, all variables are considered as predictors, excluding those used for identification and those variables which describe the error counts themselves. This predictive equation is saved as “lnC4.pred”, and is displayed below. A similar equation will be used in the analysis of each error class.
index <- createDataPartition(df1$lnClass.4, p=0.75, list=FALSE)
trainSet <- df1[ index,]
testSet <- df1[-index,]
predictors<- colnames(trainSet)
predictors <- predictors[!predictors %in% c("CoilID", "MAT_IDENT", "lTileID", "lnClass.4", "Class.4", "Class.14", "Class.15", "lnClass.14", "lnClass.15")]
lnC4.pred <- formula(paste("lnClass.4 ~ ", paste(predictors, collapse= " + ")))
lnC4.pred
## lnClass.4 ~ POSITION_X + VORG_HAUPTAGGREGAT + TO_FS_SSL + TO_FS_M +
## TO_FS_SSR + TO_SSR_FS + TO_SSR_LS + TO_LS_SSR + TO_LS_M +
## TO_LS_SSL + TO_SSL_LS + TO_SSL_FS + TM_FS_SSL + TM_FS_M +
## TM_FS_SSR + TM_SSR_FS + TM_SSR_LS + TM_LS_SSR + TM_LS_M +
## TM_LS_SSL + TM_SSL_LS + TM_SSL_FS + TU_FS_SSL + TU_FS_M +
## TU_FS_SSR + TU_SSR_FS + TU_SSR_LS + TU_LS_SSR + TU_LS_M +
## TU_LS_SSL + TU_SSL_LS + TU_SSL_FS + DT_SSR + DT_SSL + DT_FS +
## DT_LS + VG + FUELLSTAND + STRANGBREITE + WASSER_SSR + WASSER_SSL +
## WASSER_FS + WASSER_LS + STOPFENSTELLUNG + PLATTENDICKE_SSL +
## PLATTENDICKE_SSR + ARGON_DRUCK_ST + ARGON_DURCHFL_ST + ARGON_DURCHFL_DUSCH +
## TUNDISH_POSITION + VERTEILERFUELLSTAND + NETTO_PFANNENINHALT +
## KONI_LINKS + ANST__VS_HG_3__IR__S + DICKE__AL__IR__S + DICKE__HA_1__IR__S +
## DICKE__HA_2__IR__S + DICKE__VB__IR__S + ENTZ__FS_ZW_F1__IR__S +
## ENTZ__FS_ZW_F2__IR__S + ENTZ__ZWR1_AL_SN2__IR__S + ENTZ__ZW_OF_AL__IR__S +
## KEIL25__FB__IR__S + KEIL40__FB__IR__S + PR_40__FB__IR__S +
## RISS__HA_AS__IR__S + RISS__HA_BS__IR__S + TEMP__FB_1__IR__S +
## TEMP__FB_3__IR__S + TEMP__HA_1__IR__S + TEMP__HA_4__IR__S +
## TEMP__HA_5__IR__S + TEMP__HA__SR__MAX + V__FS_G2__IR__S +
## V__FS_G3__IR__S + V__FS_G4__IR__S + V__FS_G5__IR__S + V__FS_G6__IR__S +
## V__FS_G7__IR__S + WK__FS_G7__IR__S + WK__VS_SP_3__IR__S +
## WSPALT__FS_G7__IR__S
output.tree <- partykit::ctree(lnC4.pred, data = trainSet)
png("anna_tks_tree10.png", res=80, height=800, width=1600)
plot(output.tree)
dev.off()
## png
## 2
print(output.tree)
##
## Model formula:
## lnClass.4 ~ POSITION_X + VORG_HAUPTAGGREGAT + TO_FS_SSL + TO_FS_M +
## TO_FS_SSR + TO_SSR_FS + TO_SSR_LS + TO_LS_SSR + TO_LS_M +
## TO_LS_SSL + TO_SSL_LS + TO_SSL_FS + TM_FS_SSL + TM_FS_M +
## TM_FS_SSR + TM_SSR_FS + TM_SSR_LS + TM_LS_SSR + TM_LS_M +
## TM_LS_SSL + TM_SSL_LS + TM_SSL_FS + TU_FS_SSL + TU_FS_M +
## TU_FS_SSR + TU_SSR_FS + TU_SSR_LS + TU_LS_SSR + TU_LS_M +
## TU_LS_SSL + TU_SSL_LS + TU_SSL_FS + DT_SSR + DT_SSL + DT_FS +
## DT_LS + VG + FUELLSTAND + STRANGBREITE + WASSER_SSR + WASSER_SSL +
## WASSER_FS + WASSER_LS + STOPFENSTELLUNG + PLATTENDICKE_SSL +
## PLATTENDICKE_SSR + ARGON_DRUCK_ST + ARGON_DURCHFL_ST + ARGON_DURCHFL_DUSCH +
## TUNDISH_POSITION + VERTEILERFUELLSTAND + NETTO_PFANNENINHALT +
## KONI_LINKS + ANST__VS_HG_3__IR__S + DICKE__AL__IR__S + DICKE__HA_1__IR__S +
## DICKE__HA_2__IR__S + DICKE__VB__IR__S + ENTZ__FS_ZW_F1__IR__S +
## ENTZ__FS_ZW_F2__IR__S + ENTZ__ZWR1_AL_SN2__IR__S + ENTZ__ZW_OF_AL__IR__S +
## KEIL25__FB__IR__S + KEIL40__FB__IR__S + PR_40__FB__IR__S +
## RISS__HA_AS__IR__S + RISS__HA_BS__IR__S + TEMP__FB_1__IR__S +
## TEMP__FB_3__IR__S + TEMP__HA_1__IR__S + TEMP__HA_4__IR__S +
## TEMP__HA_5__IR__S + TEMP__HA__SR__MAX + V__FS_G2__IR__S +
## V__FS_G3__IR__S + V__FS_G4__IR__S + V__FS_G5__IR__S + V__FS_G6__IR__S +
## V__FS_G7__IR__S + WK__FS_G7__IR__S + WK__VS_SP_3__IR__S +
## WSPALT__FS_G7__IR__S
##
## Fitted party:
## [1] root
## | [2] VG <= 0.9956
## | | [3] ANST__VS_HG_3__IR__S <= 0.94633
## | | | [4] V__FS_G5__IR__S <= 17.89324: 0.203 (n = 6944, err = 1397.3)
## | | | [5] V__FS_G5__IR__S > 17.89324
## | | | | [6] TM_LS_SSR <= 134.1
## | | | | | [7] POSITION_X <= 75.99812: 0.418 (n = 94, err = 36.3)
## | | | | | [8] POSITION_X > 75.99812
## | | | | | | [9] KEIL40__FB__IR__S <= -0.75525: 0.176 (n = 865, err = 140.0)
## | | | | | | [10] KEIL40__FB__IR__S > -0.75525
## | | | | | | | [11] POSITION_X <= 509.71317: 0.279 (n = 2044, err = 520.3)
## | | | | | | | [12] POSITION_X > 509.71317: 0.207 (n = 1211, err = 225.4)
## | | | | [13] TM_LS_SSR > 134.1: 0.184 (n = 2167, err = 365.2)
## | | [14] ANST__VS_HG_3__IR__S > 0.94633: 0.303 (n = 426, err = 120.4)
## | [15] VG > 0.9956
## | | [16] TO_SSL_FS <= 232.65
## | | | [17] TEMP__HA_1__IR__S <= 27.29227
## | | | | [18] ENTZ__FS_ZW_F1__IR__S <= 0.04688: 0.240 (n = 2294, err = 495.0)
## | | | | [19] ENTZ__FS_ZW_F1__IR__S > 0.04688: 0.463 (n = 82, err = 28.1)
## | | | [20] TEMP__HA_1__IR__S > 27.29227: 0.273 (n = 3325, err = 776.5)
## | | [21] TO_SSL_FS > 232.65: 0.454 (n = 710, err = 239.3)
##
## Number of inner nodes: 10
## Number of terminal nodes: 11
plot(output.tree,
main = "Log Class 4 Error Counts Tree",
gp = gpar(fontsize = 10),
inner_panel=node_inner,
ip_args=list(abbreviate = FALSE, id = FALSE)
)
Above one can see both the r output describing the conditional tree, and the tree plot. One can determine that, according to the conditional tree model, the significant variables for predicting a Class 4 errror are:
These are listed without repetition, although within the tree VG is both the root node and an inner node. With regard to the plotted tree, the terminal nodes show a box plot of the observations in each node. For the nodes which do not present an obvious “box”, the implication is that there are so many 0 observations in the node that the inner quartile range has compressed around 0. As such, those nodes with observable IQR ranges in the box plot can be known to contain more observations away from 0 — ie errors. As such, one can see that, with regard to relative distributions, those nodes which contain the most error values are nodes 13, 15, 16, and 18. Variables which dictated the creation of these terminal nodes are:
In the above tree, the outcome variable, log(Class 4) errors, has multiple observable values. These different values most likely refer to slight variations in observed class 4 errors, or to changes in severity. As the goal of this analysis is to find variables linked to the presence of any error, regardless of severity, size, or other specifying qualities, it may be better to create a binary interpretation of the Class 4 error variable. This will allow the model to predict for the pure error rate, rather than forcing it to account for multiple levels of error. Below, a new variable is created from the original class 4 error data, providing a binary interpretation of the error rates.
df1$C4 <- with(df1, Class.4>0)
df1$C4<-factor(df1$C4, levels=c(FALSE,TRUE), labels=c("no.error", "error"))
prop.table(table(df1$C4))
##
## no.error error
## 0.7651588 0.2348412
From the above proportion table we can see that, under the binary interpretation, our data split 76.52% “no Class 4 error” and 23.48% “Class 4 error”.
index <- createDataPartition(df1$C4, p=0.75, list=FALSE)
trainSet <- df1[ index,]
testSet <- df1[-index,]
C4.pred <- formula(paste("C4 ~ ", paste(predictors, collapse= " + ")))
output.tree <- partykit::ctree(C4.pred, data = trainSet)
png("anna_tks_tree20.png", res=80, height=800, width=1600)
plot(output.tree)
dev.off()
## png
## 2
print(output.tree)
##
## Model formula:
## C4 ~ POSITION_X + VORG_HAUPTAGGREGAT + TO_FS_SSL + TO_FS_M +
## TO_FS_SSR + TO_SSR_FS + TO_SSR_LS + TO_LS_SSR + TO_LS_M +
## TO_LS_SSL + TO_SSL_LS + TO_SSL_FS + TM_FS_SSL + TM_FS_M +
## TM_FS_SSR + TM_SSR_FS + TM_SSR_LS + TM_LS_SSR + TM_LS_M +
## TM_LS_SSL + TM_SSL_LS + TM_SSL_FS + TU_FS_SSL + TU_FS_M +
## TU_FS_SSR + TU_SSR_FS + TU_SSR_LS + TU_LS_SSR + TU_LS_M +
## TU_LS_SSL + TU_SSL_LS + TU_SSL_FS + DT_SSR + DT_SSL + DT_FS +
## DT_LS + VG + FUELLSTAND + STRANGBREITE + WASSER_SSR + WASSER_SSL +
## WASSER_FS + WASSER_LS + STOPFENSTELLUNG + PLATTENDICKE_SSL +
## PLATTENDICKE_SSR + ARGON_DRUCK_ST + ARGON_DURCHFL_ST + ARGON_DURCHFL_DUSCH +
## TUNDISH_POSITION + VERTEILERFUELLSTAND + NETTO_PFANNENINHALT +
## KONI_LINKS + ANST__VS_HG_3__IR__S + DICKE__AL__IR__S + DICKE__HA_1__IR__S +
## DICKE__HA_2__IR__S + DICKE__VB__IR__S + ENTZ__FS_ZW_F1__IR__S +
## ENTZ__FS_ZW_F2__IR__S + ENTZ__ZWR1_AL_SN2__IR__S + ENTZ__ZW_OF_AL__IR__S +
## KEIL25__FB__IR__S + KEIL40__FB__IR__S + PR_40__FB__IR__S +
## RISS__HA_AS__IR__S + RISS__HA_BS__IR__S + TEMP__FB_1__IR__S +
## TEMP__FB_3__IR__S + TEMP__HA_1__IR__S + TEMP__HA_4__IR__S +
## TEMP__HA_5__IR__S + TEMP__HA__SR__MAX + V__FS_G2__IR__S +
## V__FS_G3__IR__S + V__FS_G4__IR__S + V__FS_G5__IR__S + V__FS_G6__IR__S +
## V__FS_G7__IR__S + WK__FS_G7__IR__S + WK__VS_SP_3__IR__S +
## WSPALT__FS_G7__IR__S
##
## Fitted party:
## [1] root
## | [2] VG <= 0.982
## | | [3] ENTZ__FS_ZW_F1__IR__S <= 0.02941
## | | | [4] WASSER_SSR <= 0.2632: no.error (n = 9064, err = 19.3%)
## | | | [5] WASSER_SSR > 0.2632: no.error (n = 3673, err = 23.7%)
## | | [6] ENTZ__FS_ZW_F1__IR__S > 0.02941
## | | | [7] POSITION_X <= 79.28875: no.error (n = 386, err = 33.4%)
## | | | [8] POSITION_X > 79.28875: no.error (n = 549, err = 22.2%)
## | [9] VG > 0.982
## | | [10] TU_LS_M <= 115.35: no.error (n = 4275, err = 26.3%)
## | | [11] TU_LS_M > 115.35
## | | | [12] VG <= 1.138
## | | | | [13] TM_LS_SSL <= 136.15
## | | | | | [14] DICKE__AL__IR__S <= 0.17795: no.error (n = 419, err = 37.5%)
## | | | | | [15] DICKE__AL__IR__S > 0.17795
## | | | | | | [16] TO_SSL_FS <= 210.8: no.error (n = 61, err = 11.5%)
## | | | | | | [17] TO_SSL_FS > 210.8: no.error (n = 171, err = 33.3%)
## | | | | [18] TM_LS_SSL > 136.15: no.error (n = 673, err = 24.2%)
## | | | [19] VG > 1.138: no.error (n = 891, err = 39.3%)
##
## Number of inner nodes: 9
## Number of terminal nodes: 10
plot(output.tree,
main = "Binary Class 4 Error Counts Conditional Tree",
gp = gpar(fontsize = 10),
inner_panel=node_inner,
ip_args=list(abbreviate = FALSE, id = FALSE)
)
Now with a binary interpretation of Class 4 errors, the printed tree output and the plotted tree can be seen above. Where before the terminal panels showed a box plot of the data in each node, here instead is a bar chart, showing the proportion of errors vs non-errors in each terminal node. It is clear from observing the tree above that the nodes with the greatest proportion of errors are nodes 7 (34.2% errors), 18 (30.8% errors), 26 (33.5% errors), and 27 (39.1% errors). Reviewing the internal nodes, one can see the significant varaibles are:
Although not amoung those variables marked as of highest significance in the linear model, VG has now appeared as the most significant variable for both the binary tree, and for the tree describing the log(Class 4) error rates. But, having run these code chunks multiple times, it has been noted that the structure of the above binary tree varies widely between each run. As such, one progresses to the use of a random forest.
In the initial run of this analysis, it was noted that there was a large amount of variation between each run, and that, without a stable seed, the results from independant runs were practically non-comparable. In part, the large amount of variability seen from model to model is due to the default construction of the cforest function.
The standard random forest model considers a default number of variables at each split in each tree, selecting randomly this amount of variables from the total avaliable. It also allows for the adjustment of the number of variables considered at each split within each tree through the use of the MTRY parameter, such that it can be increased for data sets with large numbers of variables. a specific tuning function used to ascertain the best value for the MTRY parameter iis also avliable in the randomForest package. This parameter is set internally in the cforest function to 5. It may be changed using the cforest_control function, but the party package does not come with a tuning function for this parameter. As such, the parameter would need to be adjusted by trail and error, which is why, for this analysis, only the number of trees was increased and the MTRY parameter was left at the default level.
Given that the df data frame used in the above code chunk to produce the conditional random forest contains 94 variables, many more trees are needed to consider all possible split options when MTRY is only set to 5. As such, the main alteration made to this analysis from that of Prof. Wilhelm is that the number of trees in the model above, and all following cforest models, has been increased to 1000. Tree counts of both 500 and 800 were also tested, but the amount of variability in the output models was not sufficiently decreased such as to review for significant variables. When the above model was grown with a larger amount of trees in the forest, a notable pattern of significant variables began to arise.
dev.off()
## null device
## 1
trainSet2 <- trainSet[sample(1:nrow(trainSet), 10000,
replace=FALSE),]
output.forest <- party::cforest(C4.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c4 <- party::varimp(output.forest)
png("anna_tks_tree20varimp.png", res=80, height=800, width=1600)
barchart(tail(sort(var.imp.c4), 40), xlab="Variable Importance", main="Variable Importance for Class 4 Errors")
var.imp.c4.1 <- var.imp.c4
Rather than examining the structure of any trees in the forest, displayed above is the variable importance for the Class 4 random forest. This variable importance chart will be used to review for the most significant variables in the forest. All variable importances computed in this analysis follow the permutation principle of the “mean decrease in accuracy”, which means that, the more the mean accuracy of the random forest decreases as caused by the removal or permutation of a variable, the more important that variable is deemed to be. One sees imediately that the variable deemed to be most significant in classifying Class 4 errors is here, again, VG. But, as the trees from before were suffereing from large amounts of variation, one tests the random forest model above for the same weaknesses below.
output.forest <- party::cforest(C4.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c4 <- party::varimp(output.forest)
png("anna_tks_tree20varimp.2.png", res=80, height=800, width=1600)
barchart(tail(sort(var.imp.c4), 40), xlab="Variable Importance", main="Variable Importance for Class 4 Errors")
var.imp.c4.2 <- var.imp.c4
output.forest <- party::cforest(C4.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c4 <- party::varimp(output.forest)
png("anna_tks_tree20varimp.3.png", res=80, height=800, width=1600)
barchart(tail(sort(var.imp.c4), 40), xlab="Variable Importance", main="Variable Importance for Class 4 Errors")
var.imp.c4.3 <- var.imp.c4
The above code chunk shows the creation of two additional random forest models, all grown from the same training split, and their corresponding variable importance calculations. amongst the three iterations of the model those variables which appeared within the top 20 most important variables for all three models are:
## [1] "TM_FS_SSL" "DT_FS" "TU_FS_M" "DT_SSL"
## [5] "TM_LS_SSR" "TM_LS_SSL" "V__FS_G4__IR__S" "TO_SSL_LS"
## [9] "TO_LS_SSL" "TM_SSR_FS" "KONI_LINKS" "STRANGBREITE"
## [13] "TO_SSL_FS" "WASSER_SSR" "WASSER_SSL" "VG"
Note that these variables are listed in order of increasing average importance across all three models. As such, VG is the most significant, on average, for all three models, followed by WASSER_SSR, TU_LS_M, and so on. In all three models, with the first of three used to generate the barchart above, the most important variable for accurate classification was VG, followed by WASSER_SSR. Variable Importance barcharts for all three models, and all future models, have been saved with the knitting of this document.
When considering the top 40 variables, the same amount considered for the barchart as a whole, the following variables were marked of highest importance.
## [1] "TU_FS_SSR" "TM_LS_M" "TO_FS_SSL" "POSITION_X"
## [5] "TM_FS_SSR" "PLATTENDICKE_SSL" "PLATTENDICKE_SSR" "DT_LS"
## [9] "TU_SSL_FS" "TM_SSL_FS" "TM_FS_SSL" "TO_SSR_FS"
## [13] "DT_FS" "TU_FS_M" "DT_SSL" "TM_LS_SSR"
## [17] "TM_LS_SSL" "TO_FS_SSR" "TM_SSR_LS" "V__FS_G4__IR__S"
## [21] "TO_SSL_LS" "TO_LS_SSL" "TM_SSR_FS" "KONI_LINKS"
## [25] "STRANGBREITE" "TO_SSL_FS" "WASSER_SSR" "WASSER_SSL"
## [29] "VG"
Again, these variables are listed in order of increasing average importance for all three models.
df1 <- df1 %>%
dplyr::select(-C4)
In all remaining models, the use of log error counts is eschewed in favor of the binary transformation.
df1$C14 <- with(df1, Class.14>0)
df1$C14<-factor(df1$C14, levels=c(FALSE,TRUE), labels=c("no.error", "error"))
prop.table(table(df1$C14))
##
## no.error error
## 0.991890484 0.008109516
From the above proportion table, one can see that our data is 99.19% error free, and contains only 0.81% error observations.
index <- createDataPartition(df1$C14, p=0.75, list=FALSE)
trainSet <- df1[ index,]
testSet <- df1[-index,]
predictors<- colnames(trainSet)
predictors <- predictors[!predictors %in% c("CoilID", "MAT_IDENT", "lTileID", "lnClass.4", "Class.4", "Class.14", "Class.15", "lnClass.14", "lnClass.15", "C4", "C14")]
C14.pred <- formula(paste("C14 ~ ", paste(predictors, collapse= " + ")))
output.tree.c14 <- partykit::ctree(C14.pred, data = trainSet)
png("anna_tks_tree_c14_2.png", res=80, height=800, width=1600)
plot(output.tree.c14)
dev.off()
## png
## 2
print(output.tree.c14)
##
## Model formula:
## C14 ~ POSITION_X + VORG_HAUPTAGGREGAT + TO_FS_SSL + TO_FS_M +
## TO_FS_SSR + TO_SSR_FS + TO_SSR_LS + TO_LS_SSR + TO_LS_M +
## TO_LS_SSL + TO_SSL_LS + TO_SSL_FS + TM_FS_SSL + TM_FS_M +
## TM_FS_SSR + TM_SSR_FS + TM_SSR_LS + TM_LS_SSR + TM_LS_M +
## TM_LS_SSL + TM_SSL_LS + TM_SSL_FS + TU_FS_SSL + TU_FS_M +
## TU_FS_SSR + TU_SSR_FS + TU_SSR_LS + TU_LS_SSR + TU_LS_M +
## TU_LS_SSL + TU_SSL_LS + TU_SSL_FS + DT_SSR + DT_SSL + DT_FS +
## DT_LS + VG + FUELLSTAND + STRANGBREITE + WASSER_SSR + WASSER_SSL +
## WASSER_FS + WASSER_LS + STOPFENSTELLUNG + PLATTENDICKE_SSL +
## PLATTENDICKE_SSR + ARGON_DRUCK_ST + ARGON_DURCHFL_ST + ARGON_DURCHFL_DUSCH +
## TUNDISH_POSITION + VERTEILERFUELLSTAND + NETTO_PFANNENINHALT +
## KONI_LINKS + ANST__VS_HG_3__IR__S + DICKE__AL__IR__S + DICKE__HA_1__IR__S +
## DICKE__HA_2__IR__S + DICKE__VB__IR__S + ENTZ__FS_ZW_F1__IR__S +
## ENTZ__FS_ZW_F2__IR__S + ENTZ__ZWR1_AL_SN2__IR__S + ENTZ__ZW_OF_AL__IR__S +
## KEIL25__FB__IR__S + KEIL40__FB__IR__S + PR_40__FB__IR__S +
## RISS__HA_AS__IR__S + RISS__HA_BS__IR__S + TEMP__FB_1__IR__S +
## TEMP__FB_3__IR__S + TEMP__HA_1__IR__S + TEMP__HA_4__IR__S +
## TEMP__HA_5__IR__S + TEMP__HA__SR__MAX + V__FS_G2__IR__S +
## V__FS_G3__IR__S + V__FS_G4__IR__S + V__FS_G5__IR__S + V__FS_G6__IR__S +
## V__FS_G7__IR__S + WK__FS_G7__IR__S + WK__VS_SP_3__IR__S +
## WSPALT__FS_G7__IR__S
##
## Fitted party:
## [1] root: no.error (n = 20162, err = 0.8%)
##
## Number of inner nodes: 0
## Number of terminal nodes: 1
plot(output.tree.c14,
main = "Log Class 14 Error Counts Tree",
gp = gpar(fontsize = 10),
inner_panel=node_inner,
ip_args=list(abbreviate = FALSE, id = FALSE)
)
The tree above has selected 5 different variables for its nodes. These are:
Although these variables are selected here as significant, as with previous tree models in this analysis, the structure is highly volitile under different seeds. This may in part be due to the incredibly low rate of error observations in the data, paired with the relatively large number of variables. To cope with the notable variation between tree structures, the below conditional random forest is grown, again with the number of trees to be grown set to 1000.
trainSet2 <- trainSet[sample(1:nrow(trainSet), 10000,
replace=FALSE),]
output.rf.c14 <- party::cforest(C14.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c14 <- party::varimp(output.rf.c14)
png("anna_tks_rfc14varimp.1.png", res=80, height=800, width=1600)
barchart(tail(sort(var.imp.c14), 40), xlab="Variable Importance", main="Variable Importance Fehlertyp 14")
var.imp.c14.1 <- var.imp.c14
The above bar chart was constructed using the variable importance measures for the first of three conditional random forests, generated on the training split for the Class 14 errors. Just as with the Class 4 errors, three seperate random forest models were grown for the final analysis, in order to compare their results for variability.
output.rf.c14 <- party::cforest(C14.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c14 <- party::varimp(output.rf.c14)
png("anna_tks_rfc14varimp.2.png", res=80, height=800, width=1600)
barchart(tail(sort(var.imp.c14), 40), xlab="Variable Importance", main="Variable Importance Fehlertyp 14")
var.imp.c14.2 <- var.imp.c14
output.rf.c14 <- party::cforest(C14.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c14 <- party::varimp(output.rf.c14)
png("anna_tks_rfc14varimp.3.png", res=80, height=800, width=1600)
barchart(tail(sort(var.imp.c14), 40), xlab="Variable Importance", main="Variable Importance Fehlertyp 14")
var.imp.c14.3 <- var.imp.c14
The above chunk shows the construction of the additional two random forest models, trained on the same training split as the first model, whose variable importance is presented above.
When comparing across all three models, the 20 most important variables with regard to mean decease in accuracy are:
## [1] "TM_LS_SSL" "STRANGBREITE" "TU_FS_M" "TO_LS_SSR" "TM_LS_M"
## [6] "WASSER_SSL" "TO_LS_SSL" "TU_LS_M"
Again, these variables are listed in order of increasing average importance across all three models.
When considering the top 40 most important variables for each model, as the barchart above shows for model 1, one finds the following variables to be shared by all three models.
## [1] "PLATTENDICKE_SSL" "TO_SSL_LS" "TM_FS_M"
## [4] "TM_SSL_LS" "TU_LS_SSR" "TM_SSR_LS"
## [7] "TU_SSR_FS" "DT_SSR" "TM_LS_SSR"
## [10] "TM_SSR_FS" "TO_SSR_FS" "TO_FS_SSR"
## [13] "VERTEILERFUELLSTAND" "WASSER_SSR" "ARGON_DRUCK_ST"
## [16] "VG" "TM_LS_SSL" "TU_SSL_LS"
## [19] "STRANGBREITE" "TU_FS_M" "TM_SSL_FS"
## [22] "TU_SSR_LS" "KONI_LINKS" "TO_LS_M"
## [25] "TO_LS_SSR" "DT_SSL" "TM_LS_M"
## [28] "TO_FS_SSL" "WASSER_SSL" "DT_FS"
## [31] "DT_LS" "TO_LS_SSL" "TU_LS_M"
df1 <- df1 %>%
dplyr::select(-C14)
df1$C15 <- with(df1, Class.15>0)
df1$C15<-factor(df1$C15, levels=c(FALSE,TRUE), labels=c("no.error", "error"))
prop.table(table(df1$C15))
##
## no.error error
## 0.992634477 0.007365523
As with the Class 14 errors, there is an extremely low observance rate of Class 15 errors in our data set. Only 0.74% of our observations are Class 15 errors, while 99.26% are error free.
index <- createDataPartition(df1$C15, p=0.75, list=FALSE)
trainSet <- df1[ index,]
testSet <- df1[-index,]
outcomeName<-'C15'
predictors <- predictors[!predictors %in% c("CoilID", "MAT_IDENT", "lTileID", "lnClass.4", "Class.4", "Class.14", "Class.15", "lnClass.14", "lnClass.15", "C4", "C14", "C15")]
C15.pred <- formula(paste("C15 ~ ", paste(predictors, collapse= " + ")))
output.tree.c15 <- partykit::ctree(C15.pred, data = trainSet)
png("anna_tks_tree_c15_2.png", res=80, height=800, width=1600)
plot(output.tree.c15)
dev.off()
## png
## 2
print(output.tree.c15)
##
## Model formula:
## C15 ~ POSITION_X + VORG_HAUPTAGGREGAT + TO_FS_SSL + TO_FS_M +
## TO_FS_SSR + TO_SSR_FS + TO_SSR_LS + TO_LS_SSR + TO_LS_M +
## TO_LS_SSL + TO_SSL_LS + TO_SSL_FS + TM_FS_SSL + TM_FS_M +
## TM_FS_SSR + TM_SSR_FS + TM_SSR_LS + TM_LS_SSR + TM_LS_M +
## TM_LS_SSL + TM_SSL_LS + TM_SSL_FS + TU_FS_SSL + TU_FS_M +
## TU_FS_SSR + TU_SSR_FS + TU_SSR_LS + TU_LS_SSR + TU_LS_M +
## TU_LS_SSL + TU_SSL_LS + TU_SSL_FS + DT_SSR + DT_SSL + DT_FS +
## DT_LS + VG + FUELLSTAND + STRANGBREITE + WASSER_SSR + WASSER_SSL +
## WASSER_FS + WASSER_LS + STOPFENSTELLUNG + PLATTENDICKE_SSL +
## PLATTENDICKE_SSR + ARGON_DRUCK_ST + ARGON_DURCHFL_ST + ARGON_DURCHFL_DUSCH +
## TUNDISH_POSITION + VERTEILERFUELLSTAND + NETTO_PFANNENINHALT +
## KONI_LINKS + ANST__VS_HG_3__IR__S + DICKE__AL__IR__S + DICKE__HA_1__IR__S +
## DICKE__HA_2__IR__S + DICKE__VB__IR__S + ENTZ__FS_ZW_F1__IR__S +
## ENTZ__FS_ZW_F2__IR__S + ENTZ__ZWR1_AL_SN2__IR__S + ENTZ__ZW_OF_AL__IR__S +
## KEIL25__FB__IR__S + KEIL40__FB__IR__S + PR_40__FB__IR__S +
## RISS__HA_AS__IR__S + RISS__HA_BS__IR__S + TEMP__FB_1__IR__S +
## TEMP__FB_3__IR__S + TEMP__HA_1__IR__S + TEMP__HA_4__IR__S +
## TEMP__HA_5__IR__S + TEMP__HA__SR__MAX + V__FS_G2__IR__S +
## V__FS_G3__IR__S + V__FS_G4__IR__S + V__FS_G5__IR__S + V__FS_G6__IR__S +
## V__FS_G7__IR__S + WK__FS_G7__IR__S + WK__VS_SP_3__IR__S +
## WSPALT__FS_G7__IR__S
##
## Fitted party:
## [1] root
## | [2] DICKE__VB__IR__S <= 1.36037
## | | [3] STRANGBREITE <= 2530
## | | | [4] VERTEILERFUELLSTAND <= 78.16667: no.error (n = 1934, err = 0.4%)
## | | | [5] VERTEILERFUELLSTAND > 78.16667
## | | | | [6] TU_FS_M <= 117.73333: no.error (n = 2398, err = 1.5%)
## | | | | [7] TU_FS_M > 117.73333
## | | | | | [8] TM_FS_M <= 117.3: no.error (n = 15, err = 33.3%)
## | | | | | [9] TM_FS_M > 117.3: no.error (n = 197, err = 5.1%)
## | | [10] STRANGBREITE > 2530: no.error (n = 15263, err = 0.5%)
## | [11] DICKE__VB__IR__S > 1.36037: no.error (n = 355, err = 3.1%)
##
## Number of inner nodes: 5
## Number of terminal nodes: 6
plot(output.tree.c15,
main = "Log Class 15 Error Counts Tree",
gp = gpar(fontsize = 10),
inner_panel=node_inner,
ip_args=list(abbreviate = FALSE, id = FALSE)
)
The above output shows the conditional tree grown for Class 15 errors. The node which has the highest number of errors is node 10, which is 23.1% Class 15 errors. The variables chosen for inner nodes in this tree are listed as follows, without repetition.
Again, iteratively growing the tree showed notable amounts of variation in structure, so 3 random forest models, with tree count set to 1000, have been grown below.
trainSet2 <- trainSet[sample(1:nrow(trainSet), 10000,
replace=FALSE),]
output.rf.c15 <- party::cforest(C15.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c15 <- party::varimp(output.rf.c15)
png("anna_tks_rfc15varimp.1.png", res=80, height=800, width=1600)
barchart(tail(sort(var.imp.c15), 40), xlab="Variable Importance", main="Variable Importance Class 15 Error")
var.imp.c15.1 <- var.imp.c15
Above is the variable importance barchart for the first conditional random forest model for Class 15 errors. Imediately obvious is the fact that most variables chosen by the tree above do not feature in the first 20 variables, excluding STRANGBREITE. This again stresses the variation in the model on the tree level. In order to cope with this variability, two more random forest models are grown below, on the same training split as the model above, and their most important variables are compared.
output.rf.c15 <- party::cforest(C15.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c15 <- party::varimp(output.rf.c15)
png("anna_tks_rfc15varimp.2.png", res=80, height=800, width=1600)
barchart(tail(sort(var.imp.c15), 40), xlab="Variable Importance", main="Variable Importance Class 15 Error")
var.imp.c15.2 <- var.imp.c15
output.rf.c15 <- party::cforest(C15.pred, data = trainSet2, control = cforest_control(ntree =1000))
var.imp.c15 <- party::varimp(output.rf.c15)
png("anna_tks_rfc15varimp.3.png", res=80, height=800, width=1600)
barchart(tail(sort(var.imp.c15), 40), xlab="Variable Importance", main="Variable Importance Class 15 Error")
var.imp.c15.3 <- var.imp.c15
Those variables which ranked in the top 20 most important for each model are listed as follows.
## [1] "TO_SSR_LS" "TM_FS_M" "TU_SSL_LS" "TO_SSL_FS" "TM_FS_SSR"
## [6] "KONI_LINKS" "STRANGBREITE" "TO_FS_M" "TU_FS_SSR"
The above variables are listed in order of increasing average importance across all three models. When considering the top 40 most important variables for all three models, we find that the variables listed below are shared by all three.
## [1] "TM_LS_SSR" "TO_FS_SSR" "PLATTENDICKE_SSR" "TU_LS_SSL"
## [5] "TU_LS_SSR" "TO_LS_SSL" "DT_LS" "TO_LS_SSR"
## [9] "TM_SSL_LS" "TU_FS_M" "DICKE__VB__IR__S" "TO_LS_M"
## [13] "TU_SSR_LS" "TO_SSR_LS" "TU_SSL_FS" "TM_FS_M"
## [17] "TU_SSL_LS" "TM_SSR_LS" "TO_SSL_FS" "TM_FS_SSR"
## [21] "KONI_LINKS" "STRANGBREITE" "TO_FS_M" "TU_FS_SSR"
df1 <- df1 %>%
dplyr::select(-C15)
As the errors are all ocurring on the same sheets, it is possible that the errors are, at times, ocurring simultaneously. With this in mind, it is useful to know which, if any, variables are significant for classifying all three error classes.
Variables which were ranked as being in the top 20 most important variables for all three error class models are listed as follows.
## [1] "STRANGBREITE"
Varaibles marked as one of the top 40 significant variables in all three error classes’ conditional random forest models are:
## [1] "DT_LS" "TU_FS_M" "TM_LS_SSR" "TO_FS_SSR" "TM_SSR_LS"
## [6] "TO_LS_SSL" "KONI_LINKS" "STRANGBREITE"
In the above analysis, data produced during earlier data reduction and treatment, steps 1-4 avaliable on the Nextcloud server, is treated for possible redundancies produced by highly correlated variables and reviewed for any interesting patterns. The data is then used to create tree and forest models for classifying the presence or lack of three seperate error types — Classes 4, 14, and 15.
First, Class 4 errors are transformed using a log(data + 1) transformation. This was done in an attempt to deal with the high levels of skew in the data, or, in other words, to deal with the very low occurancce rate of errors in the data. Class 4 errors were the only class to show notable change under the transformation, as this error type had multiple levels throughout the observations. When modeled with a linear regression model, 18 different variables were found to have significant effect in the model. But, as the model was presenting with an incredibly poor fit, these results were regarded as suspect. In order to create a better fitting model to the data at hand, conditional tree and conditional random forest models were grown for the data.
For the first tree, the log transformation of Class 4 error was again used as the response variable of the model. Although the tree was highly volitile under different seeds, the variables marked as significant were,
A binary transformation was then created for the Class 4 error variable, which simply marked the presence or absence of an error. This was then provided to a new conditional tree. The tree grown with a binary response generated the following significant variables.
Again, under different seeds this tree’s structure was found to be highly variable. It is worth noting that these two trees already share some variables of significance. Both models utilize the following variables at various inner nodes.
Finally, to deal with the variability in the tree structure, three conditional random forest models were created. The first version of this analysis grew forests with only the default 500 trees, but it was found that such models were still highly volitile under different seeds, and as such, the tree count in the forest was increased first to 800, and then finally to the 1000 trees seen grown in all above conditional forest models. The MTRY parameter, which controls the number of variables considered at each split in each tree of the forest was also considered for tuning, but as there is not a built in function for the party package to tune this parameter, it was left at the default 5, to avoid the possibility of overfitting. All three forests were grown from the same training split, to allow for comparibility of output.
Each conditional forest produced slightly different results for variable importantance results, due to the inherent variablity in the creation of the forest model. That being said, the three models produced above still marked a notable number of the same variables as being of comparable importance. Of the top 20 most important variables in each forest, all three models included the following variables. Note that these are listed in order of increasing average importance across all three models, such that the most important variable on average for the three models was VG.
C4Top20
## [1] "TM_FS_SSL" "DT_FS" "TU_FS_M" "DT_SSL"
## [5] "TM_LS_SSR" "TM_LS_SSL" "V__FS_G4__IR__S" "TO_SSL_LS"
## [9] "TO_LS_SSL" "TM_SSR_FS" "KONI_LINKS" "STRANGBREITE"
## [13] "TO_SSL_FS" "WASSER_SSR" "WASSER_SSL" "VG"
In this list we see only one variable which is shared by the two conditional trees: VG.
Of the top 40 variables marked as most important in each model, the three models shared the following.
C4Top40
## [1] "TU_FS_SSR" "TM_LS_M" "TO_FS_SSL" "POSITION_X"
## [5] "TM_FS_SSR" "PLATTENDICKE_SSL" "PLATTENDICKE_SSR" "DT_LS"
## [9] "TU_SSL_FS" "TM_SSL_FS" "TM_FS_SSL" "TO_SSR_FS"
## [13] "DT_FS" "TU_FS_M" "DT_SSL" "TM_LS_SSR"
## [17] "TM_LS_SSL" "TO_FS_SSR" "TM_SSR_LS" "V__FS_G4__IR__S"
## [21] "TO_SSL_LS" "TO_LS_SSL" "TM_SSR_FS" "KONI_LINKS"
## [25] "STRANGBREITE" "TO_SSL_FS" "WASSER_SSR" "WASSER_SSL"
## [29] "VG"
As the log transform for Class 14 errors did not appear to have a useful effect on the data set, only the binary transformation was applied and modeled. The tree created for the binary Class 14 error variable was again highly volitile. The tree that was grown under the seed used here only selected one variable, DT_FS, as significant for classifying the response, while each node then contained less than 1% errors. Again, the tree count of the random forest model was boosted to 1000, and three conditional random forests were grown in order to compare results for volitility.
Given inherent expections of variability in output, all three models still selected a notable group of variables which were marked significant in each individual model. They are listed as follows, in order of increasing average importance for all three models.
C14Top20
## [1] "TM_LS_SSL" "STRANGBREITE" "TU_FS_M" "TO_LS_SSR" "TM_LS_M"
## [6] "WASSER_SSL" "TO_LS_SSL" "TU_LS_M"
As one can see, the variable marked as significant in the orginal tree is not in the shared variables for the top 20 variables in the three models. DT_FS is amongst those variables shared within the Top 40 most important variables though, the list of which follows below.
C14Top40
## [1] "PLATTENDICKE_SSL" "TO_SSL_LS" "TM_FS_M"
## [4] "TM_SSL_LS" "TU_LS_SSR" "TM_SSR_LS"
## [7] "TU_SSR_FS" "DT_SSR" "TM_LS_SSR"
## [10] "TM_SSR_FS" "TO_SSR_FS" "TO_FS_SSR"
## [13] "VERTEILERFUELLSTAND" "WASSER_SSR" "ARGON_DRUCK_ST"
## [16] "VG" "TM_LS_SSL" "TU_SSL_LS"
## [19] "STRANGBREITE" "TU_FS_M" "TM_SSL_FS"
## [22] "TU_SSR_LS" "KONI_LINKS" "TO_LS_M"
## [25] "TO_LS_SSR" "DT_SSL" "TM_LS_M"
## [28] "TO_FS_SSL" "WASSER_SSL" "DT_FS"
## [31] "DT_LS" "TO_LS_SSL" "TU_LS_M"
Again, the log transformation of the class 15 errors was not useful for dealing with the poor spread of the data, so only the binary response was modeled for this class.
The conditional tree grown under varying seeds was observed to be highly volitile, again due to the low observation rate of class 15 errors and the relatively high number of variables in the data set. Still, it marked 4 different variables – one was used twice at two seperate nodes – as being significant for selecting class 15 errors. Node 9 of this tree caught the majority of the errors in the data set - of the approximately 12.9% of the observations in the training set which were errors, node 9 contains 7.4%
To deal with the large amounts of volitility seen in the tree model, 3 seperate random forest models were grown on the same training set, and their results were compared. In the top 20 variables marked as most important across all three models, those that were shared are listed below in order of increasing average importance for the three models.
C15Top20
## [1] "TO_SSR_LS" "TM_FS_M" "TU_SSL_LS" "TO_SSL_FS" "TM_FS_SSR"
## [6] "KONI_LINKS" "STRANGBREITE" "TO_FS_M" "TU_FS_SSR"
The variables which were shared by each model in the list of top 40 most important variables follow bellow.
C15Top40
## [1] "TM_LS_SSR" "TO_FS_SSR" "PLATTENDICKE_SSR" "TU_LS_SSL"
## [5] "TU_LS_SSR" "TO_LS_SSL" "DT_LS" "TO_LS_SSR"
## [9] "TM_SSL_LS" "TU_FS_M" "DICKE__VB__IR__S" "TO_LS_M"
## [13] "TU_SSR_LS" "TO_SSR_LS" "TU_SSL_FS" "TM_FS_M"
## [17] "TU_SSL_LS" "TM_SSR_LS" "TO_SSL_FS" "TM_FS_SSR"
## [21] "KONI_LINKS" "STRANGBREITE" "TO_FS_M" "TU_FS_SSR"
Finally, the variables shared by all three error class model sets as being in the top 20 most important variables are listed in increasing order of importance, as follows.
AllTop20
## [1] "STRANGBREITE"
Hothorn, Torsten, et al. “Ctree: Conditional Inference Trees.” Ctree: Conditional Inference Trees, Cran, cran.r-project.org/web/packages/partykit/vignettes/ctree.pdf.
Jackson, Simon. “Exploring Correlations in R with Corrr . BlogR.” BlogR on Svbtle, 21 Aug. 2018, drsimonj.svbtle.com/exploring-correlations-in-r-with-corrr.
Zhu, Hao. “Package ‘KableExtra.’” KableExtra.pdf, Cran, 22 Jan. 2019, cran.r-project.org/web/packages/kableExtra/kableExtra.pdf.